This is the mail archive of the gsl-discuss@sourceware.org mailing list for the GSL project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Help-gsl] Spearman rank correlation coefficient


Hello,

It would be best to move this discussion over to gsl-discuss. I think it would be very useful to have this function in GSL. Just a few comments on your code:

1) The code looks clean and nicely commented. One issue is that since you appear to have followed the apache code very closely, there may be a licensing issue - I don't know if the Apache license is compatible with the GPL. On a quick check, its possible we can use it but it seems we need to preserve the original copyright notice.

2) Dynamic allocation - it looks like you dynamically allocate 5 different arrays to do the calculation. It would be better to either make functions like gsl_stats_spearman_alloc and gsl_stats_spearman_free, or to pass in a pre-allocated workspace as one of the function arguments. Since you're using workspace of different types (double,size_t), its probably better to make the alloc/free functions.

3) One of your dynamically allocated arrays is realloc()'d in a loop. Is this because the size of the array is unknown before the loop? Perhaps there is a way to avoid the realloc's.

4) We also need to think of some automated tests that can be added to statistics/test.c to test this function exhaustively and make sure its working correctly - even if that consists simply of known output values for a few different input cases.

Good work,
Patrick Alken

On 02/09/2012 04:26 PM, TimothÃe Flutre wrote:
Hello,

I noticed that only the Pearson correlation coefficient is implemented
in the GSL (http://www.gnu.org/software/gsl/manual/html_node/Correlation.html).
However, in quantitative genetics, several authors are using the
Spearman coef (for instance, Stranger et al "Population genomics of
human gene expression", Nature Genetics, 2007) as it is less
influenced by outliers.

Current high-throughput data requires to compute such coef several
millions of times. Thus I implemented the computation of the Spearman
coef in GSL-like code. In fact, one just need to rank the input
vectors and then compute the Pearson coef on them. For the ranking, I
got inspired by the code from the Apache Math module.

I was thinking that it could be useful to other users to add my piece
of code to the file "covariance_source.c" of the GSL
(http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/covariance_source.c#L77).
So here is the code: https://gist.github.com/1784199

I am not very proficient in C, so even if it is not possible to
include the code in the GSL, don't hesitate to give me advice.

Thanks,
Tim



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]