This is the mail archive of the gsl-discuss@sourceware.org mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Help-gsl] Spearman rank correlation coefficient


Yes I apologize for not responding sooner. I looked at your updated code, and just have a few comments:

1. I recommend making a gsl_stats_spearman_workspace struct to hold the allocated variables (ranks1, ranks2, d, p). This way the user doesn't need to know the details of everything that gets allocated and can simply call the alloc routine with the size n parameter and pass the struct pointer to subsequent routines (this is the way that most of gsl is designed).

2. Typically new features to GSL are made as extensions first (see: http://www.gnu.org/software/gsl/ under "Extensions"). This way people can try out the new feature and test it well before it is incorporated into the main source tree. There are examples there on how to make your code into an extension. I should be able to add your extension to the web page when its ready.

Patrick

On 03/02/2012 08:32 AM, TimothÃe Flutre wrote:
Hello,

a month ago I proposed an implementation of the Spearman rank
correlation coefficient as it is missing in the GSL (see emails
below). I took into account some advice and the updated code is
available here:
https://gist.github.com/1784199#file_spearman_v2.c

Since then, I didn't have any answer. I'm not an experienced C
programmer, thus my code may need further improvements, but still it
can be useful to others. Thus can I submit it to the GSL main trunk?
I've never done that before. Can someone indicate me what to do?
Should I request "developer write access" for instance?

Thanks in advance,
Tim


2012/2/11 TimothÃe Flutre<timflutre@gmail.com>
Thanks for your input!

1) Here is the text of the license under which the Apache code is:
http://www.apache.org/licenses/LICENSE-2.0. Indeed it seems that we
would have to indicate their copyright. Is this a problem? In a way,
there is not a lot of different algorithms to compute the Spearman
coefficient...

2) I have made the changes and now have "gsl_stats_spearman_alloc" and
"gsl_stats_spearman_free" functions for the four arrays ranks1,
ranks2, d and p. I added the code as a 2nd file to the same gist:
https://gist.github.com/1784199#file_spearman_v2.c

3) Yes, we don't know in advance how many ties there will be. That's
why I reallocate inside the loop. I don't see how I can do
differently.

4) I added a function performing tests, using the data defined in
statistics/test_float_source.
c. What do I do now? Do I need to have write access to the GSL
repository on Savannah? Or maybe someone else can do it for me?

Thanks,
Tim


On Thu, Feb 9, 2012 at 6:04 PM, Patrick Alken <Patrick.Alken@colorado.edu> wrote:
Hello,

  It would be best to move this discussion over to gsl-discuss. I think
it would be very useful to have this function in GSL. Just a few comments on
your code:

1) The code looks clean and nicely commented. One issue is that since
you appear to have followed the apache code very closely, there may be a
licensing issue - I don't know if the Apache license is compatible with the
GPL. On a quick check, its possible we can use it but it seems we need to
preserve the original copyright notice.

2) Dynamic allocation - it looks like you dynamically allocate 5
different arrays to do the calculation. It would be better to either make
functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to
pass in a pre-allocated workspace as one of the function arguments. Since
you're using workspace of different types (double,size_t), its probably
better to make the alloc/free functions.

3) One of your dynamically allocated arrays is realloc()'d in a loop. Is
this because the size of the array is unknown before the loop? Perhaps there
is a way to avoid the realloc's.

4) We also need to think of some automated tests that can be added to
statistics/test.c to test this function exhaustively and make sure its
working correctly - even if that consists simply of known output values for
a few different input cases.

Good work,
Patrick Alken


On 02/09/2012 04:26 PM, TimothÃe Flutre wrote:
Hello,

I noticed that only the Pearson correlation coefficient is implemented
in the GSL
(http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
However, in quantitative genetics, several authors are using the
Spearman coef (for instance, Stranger et al "Population genomics of
human gene expression", Nature Genetics, 2007) as it is less
influenced by outliers.

Current high-throughput data requires to compute such coef several
millions of times. Thus I implemented the computation of the Spearman
coef in GSL-like code. In fact, one just need to rank the input
vectors and then compute the Pearson coef on them. For the ranking, I
got inspired by the code from the Apache Math module.

I was thinking that it could be useful to other users to add my piece
of code to the file "covariance_source.c" of the GSL

(http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
So here is the code: https://gist.github.com/1784199

I am not very proficient in C, so even if it is not possible to
include the code in the GSL, don't hesitate to give me advice.

Thanks,
Tim



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]