This is the mail archive of the gsl-discuss@sources.redhat.com mailing list for the GSL project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Multivariate minimization

From: Fabrice Rossi <rossi at ufrmd dot dauphine dot fr>
To: "Robert G. Brown" <rgb at phy dot duke dot edu>
Cc: Przemyslaw Sliwa <sliwa at euv-frankfurt-o dot de>,gsl-discuss at sources dot redhat dot com
Date: Wed, 27 Aug 2003 10:26:25 +0200
Subject: Re: Multivariate minimization
References: <Pine.LNX.4.44.0308260946180.14263-100000@ganesh.phy.duke.edu>

Hi Robert.

I agree that you have to use some sort of global search even when a local minimum is far enough. But I disagree on some other points:

- when you evaluate the parameters of a statistical model (such as the weights of a neural network), you mostly don't care about the _relative_ quality of the minimum. What you care about is the quality of the model, so as to calculate confidence intervals on predictions, for instance. The fact that with some other set of weights you might obtain better results is not always relevent. What is relevent in practice is wether the set of parameters gives a model that is precise enough for your goal. The difficult (and time consumming) part of statistical modelling is not the estimation part but the quality assesment part (in which optimisation plays no part), where you have to really on brute force methods such cross validation or bootstrap.

In practice I indeed use multiple starting points with neural networks, but all my experiments with real data show that the number of starting points is not really important and has no practical effects on the results, as long as it is kept above a minimum value (such as 5). Moreover, there are evidence that going to the actual global minimum is _not_ a good idea for neural networks training (and more generally for statistical learning) because it leads to overtraining. In general, we use a penalization term (weight decay or something more sophisticated). The main effect of this term is to give a much simpler shape to the error function and therefore to simplify greatly the training procedure. Another thing that simplify the traning is the fact that neural networks are semi-parametric models in which you can add easily more power. The general effect of adding more power is to simplify the training (but not to speed it up, of course).

- of course, what I just wrote is false on a theoretical point of view. Indeed most of the asymptotic theorem in statistical learning make the assumption that you have indeed discover the global minimum. But in practice, you can still apply them successfully with standard gradient based training. Moreover, there are ways to make this valid on a theoretical point of view, for instance by restricting the search space.

- I think toys problem are what there name suggests: toys. They are interesting on a theoretical point of view, but they don't reflect real world data. The main problem with your examples is that the function to learn is not smooth. No free lunch theorems are exactly based on this kind of functions: given a minimisation algorithm you can "easily" construct a problem that this algorithm will not be able to solve better than an exhaustive search. GA and NN work in practice because we apply them to smooth problems, which is not the case of problems constructed in no freech lunch theorems.

- I do agree that neural networks (and more generally statistical learning methods) work in high dimension because they are able to discover a low dimensionnal structure in the data. BUT we where not talking about 1000 variables but rather 1000 _parameters_, which is obviously not the same thing. I agree that trying to model a 1000 variables problem without some special tricks is doomed to failure, but as I said before, I've used neural networks with 100 variables and 2000 weights without problems.

Fabrice Rossi

Robert G. Brown wrote:

On Tue, 26 Aug 2003, Fabrice Rossi wrote:

Hi Robert.

In many applications, a local minimum is far enough. This is the case for instance with neural networks. It is not uncommon to use neural networks with more than 1000 numerical parameters and still to use vector BFGS. You're perfectly right to say that GLOBAL optimization is a major problem (and in fact no free lunch theorems show that even GA cannot do anything about that), but this is not always what we want to do.

I'd have to disagree, if by that you mean "just" use a gradient search
of any sort.  Or rather, whether you can or not depends strictly on your
prior knowledge of the complexity properties of the multivariate
objective function and its information-theoretic content.  In the
general case where the objective function is unknown and impossible to
functionally analyze but can still be evaluated by some procedure (e.g.
from training data and a set of trial weights in the case of a neural
network), without doing SOME sort of global search one cannot make any
statement at all about the RELATIVE quality of a minimum obtained from a
single gradient descent.

This is easily illustrated with a toy problem.  Consider the "divide-by"
problem.  If one a binary representation of integers with (say) N = 16
bits, creates a training set out of a fraction of the integers that are
and aren't divisible by (say) m = 7, it is possible to train a
classification network using a training set using, say, half the
integers divisible by 7 and an equal number of numbers that aren't that
can correctly classify 99% of the integers divisible by 7 (getting a
mere handful wrong) in the entire set of 16 bit ints, which is very cool
in and of itself.  However, you aren't going to find the nets that
"solve" for this highly nonlinear function with a gradient descent from
a single random set of starting weights.  It requires quite a lot of
intelligent exploration to find starting networks that, when improved by
gradient descent, achieve the nearly perfect resolution of the problem.

This is interesting because in this case, you know precisely the information content of the inputs: all sixteen bits. One cannot generally tell if a binary number is divisible by seven until all of its digits are known; all the intermediate weights from the 16 inputs have to work together in a single, fully correlated pattern to produce the correct result. There is no obvious decomposition of the problem into simpler units where the network can do less global work and trivially combine the results at the end.

This solution lives in and is smoothly connected to only a tiny fraction
of the available phase volume.  Even starting off with a GA I've found
that it STILL takes multiple runs and a better-than-average GA to have a
good chance of finding one of the "solution" networks.  So forget 1000's
of variables -- gradient descent alone won't generally work even for an
easily understandable toy problem with only 16 inputs and a few tens of
hidden layer weight variables when the problem has no decompositions of
lower dimensionality.

Compare this to the divide by >>m = 2<< problem in binary (with an
obvious single-bit decomposition) where a gradient descent from nearly
anywhere will find a really perfect solution. This illustrates fairly
directly how decompositions that in general are not a priori known can
have a huge influence on whether a given methodology works at all and is
robust even for a simple "divide by" problem.

A third example from this toy problem illustrates how the methodology
used can easily de facto projectively reduce the problem's
dimensionality -- if one trains on division by 14, there exists a
trivial decomposition that immediately ELIMINATES all odd numbers
(improving classification by close to a factor of 2 compared to random
chance without trying hard at all).  Gradient descent will find this
part of the solution from nearly anywhere, and will likely obtain
something that is nearly noise on the issue of whether the remaining
even numbers are ALSO divisible by seven (it is still solving the
division by seven problem, but only on the 15 remaining bits).  One thus
sees a significant improvement in classification ability relative to
random chance, but one is (actually) rather far away from solving the
problem, although it might take you thousands of runs to learn that.

This last example shows very clearly what a neural network trained on a
general problem using only a single gradient descent is almost certainly
doing.  It is finding a problem decomposition of much lower
dimensionality that yields an "easy benefit" by ignoring all but a few
degrees of freedom -- going quickly downhill where the going is easy,
then twisting around in the volume of phase space doing a local
optimization of the nearly irrelevant remaining variables.

To put it another way, only when you already know a great deal about the
information content, complexity and decompositions of the PARTICULAR
nonlinear problem from prior explorations do you get enough information
about the distribution of "distinct" valley minima in the nonlinear
function itself to be say whether the minimum obtained from a single
trial is likely to be any good.  You cannot get this information in the
most general case without "wasting" computational energy exploring
global solutions before expending the rest to locally optimize.

rgb

References:
- Re: Multivariate minimization
  - From: Robert G. Brown

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]