This is the mail archive of the
gsl-discuss@sources.redhat.com
mailing list for the GSL project.
Re: CDF diff
- From: Martin Jansche <jansche at ling dot ohio-state dot edu>
- To: Jason Hooper Stover <jason at sakla dot net>
- Cc: <gsl-discuss at sources dot redhat dot com>
- Date: Wed, 15 Jan 2003 21:34:55 -0500 (EST)
- Subject: Re: CDF diff
On Tue, 14 Jan 2003, Jason Hooper Stover wrote:
> I hadn't planned to compute Pr(a<Z<b) directly since such
> a function isn't a "cumulative distribution function"
> according to the common definition. Also, users can
> compute Pr(a<Z<b) as Pr(Z<b)-Pr(Z<a).
True, but what if I need Pr(a<Z<b) for a discrete distribution where
Pr(Z<b) has to be computed via an explicit summation (or does that
never happen?) and a>>0? OK, I guess this could be pointed out in
documentation and one could then do the summation in user code.
> effort for a library like GSL. E.g., such an approach might make
> sense for a higher-level program that runs statistical routines
> for the exponential family. But I could be wrong; maybe it would
> make sense in GSL. If you have some coded examples of this kind
> of thing, send them along.
For example, you could define a type
typedef struct {
double shape;
double scale;
double pdf_lognorm;
double pdf_x0;
} gsl_ran_gamma_t;
with a corresponding constructor
gsl_ran_gamma_t *
gsl_ran_gamma_alloc (const double shape,
const double inverse_scale)
{
gsl_ran_gamma_t *p;
double scale;
if (shape <= 0 || inverse_scale <= 0)
GSL_ERROR_VAL ("requires shape > 0, inverse_scale > 0",
GSL_EDOM, NULL);
p = malloc (sizeof (gsl_ran_gamma_t));
if (p == NULL)
GSL_ERROR_VAL ("failed to allocate space for ran_gamma struct",
GSL_ENOMEM, NULL);
scale = 1 / inverse_scale;
p->shape = shape;
p->scale = scale;
if (shape == 1)
{
p->pdf_lognorm = log (scale);
p->pdf_x0 = scale;
}
else
{
p->pdf_lognorm = shape * log (scale) - gsl_sf_lngamma (shape);
p->pdf_x0 = 0;
}
return p;
}
Then you can define gamma_pdf and gamma_cdf like this:
double
gsl_ran_gamma_pdf (const double x, gsl_ran_gamma_t *params)
{
if (x < 0)
{
return 0;
}
else if (x == 0)
{
return params->pdf_x0;
}
else
{
double a = params->shape - 1;
double logp = params->pdf_lognorm;
logp -= params->scale * x;
if (a != 0)
logp += a * log (x);
return exp (logp);
}
}
double
gsl_ran_gamma_cdf (const double x, gsl_ran_gamma_t *params)
{
if (x <= 0)
{
return 0;
}
else
{
double y = x * params->scale;
if (params->shape == 1)
return fabs (gsl_expm1 (-y));
else
return gsl_sf_gamma_inc_P (params->shape, y);
}
}
This has the disadvantage of introducing yet another type and
allocator for each family, and so of course it breaks compatibility.
But it has several advantages *if you compute many values for the same
distribution with fixed parameters* (or if you repeatedly sample from
the same distribution), which is not too uncommon IMO:
First, domain checks on the parameters have to be done exactly once
in the _alloc() function. This is an easy way to avoid code
duplication, since the same checks would have to be carried out for
the pdf, the cdf, and any other functionality that may be added in
the future (inverse cdf, mode, etc.).
Second, computing the normalization term is typically an expensive
operation (e.g. binomial likelihood is easy to compute, but the
corresponding Beta density is more costly), but does not depend on
the value of x; in general, all values that are independent of x
can be precomputed in the _alloc() function and stored in the
parameter struct. In the code above, lngamma(shape) is evaluated
exactly once.
Third, the functions now have a form in which they could be used
directly in conjunction with gsl_function(_fdf), e.g. if a user
wanted to use a one-dimensional root finding algorithm to compute
inverse cdfs.
Fourth, this format would make it easy to incorporate distributions
that are simple reparametrizations of existing distributions. For
example, one could now have a function
gsl_ran_gamma_t *gsl_ran_chisq_alloc(const double df);
and simply not implement gsl_ran_chisq_pdf separately from
gsl_ran_gamma_pdf. (Or if this should be hidden from the user, it'd
be easy to generate the illusion of separate functionality with
#defines).
It would even be possible to do this in a somewhat object-oriented way
and include function pointers into the struct so that one would get a
a gamma density via:
gsl_ran_gamma_t *g = gsl_ran_gamma_alloc (1, 1);
double x = g->pdf (x, g); /* presumably hidden behind a macro */
Notice that gsl_ran_gamma_alloc could have quietly initialized the pdf
field with a pointer to a function that computes an exponential
density (because shape == 1).
One potentail counterargument against this design might be that it
makes it more difficult for the user to implement likelihood based
computations, since the parameters of the distribution would
presumably change frequently. I wouldn't consider this a valid
objection though, since for efficiently computing likelihood for many
samples the code has to be structured very differently anyway.
- martin