This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash table subtleties


Tel <telford@eng.uts.edu.au> writes:

 > > The two key assumptions are that the hash function approximates a
 > > random variable and that the input data is randomly distributed.
 > > Neither is really the case.
 > 
 > I see only one assumption here, the idea of a good hash algorithm
 > is that it converts non-random input data into what looks like a random
 > variable. The distribution of the input data really should not matter.

Well, yes, when phrased that way.  But actually, the hash fcn *is* a
fcn, so the output distribution is a function of the input
distribution.  So, it's actually an issue of how close to uniformly
distributed the output is on the given input distribution.

<snip>

 > OK, how about this proposal (for those who panic about the worst case):
 > Define ``bucket vector utilisation'' as the number of non-empty buckets
 > divided by the total buckets (not a difficult number to keep track of).
 > Determine an upper and lower bound for bucket vector utilisation,
 > grow when you get bigger than the upper bound, shrink when you get less
 > than the lower bound (thus still giving some hysteresis).
 > 
 > In the average case (based on the probability stuff) bucket vector
 > utilisation will be 1 minus the probability of an empty bucket, from
 > Harvey's figures:
 > 
 > table utilisation              bucket vector utilisation
 >       1.0                          0.635
 >       2.0                          0.867
 >       3.0                          0.952
 > 
 > So on the average, choosing upper and lower bounds for bucket
 > vector utilisation is much the same as choosing upper and lower
 > bounds for table utilisation. You have to make sure that your
 > upper bound is less than 1.0! Finding really good values may take
 > some fiddling but you have the average case covered and you have
 > some theory to make you feel righteous.
 > 
 > On the other hand, in the worst case of all keys falling into
 > the same bucket, the bucket vector utilisation won't change so
 > the table will never grow (i.e. linear performance with no memory loss).

This is no good.  All the kesy can be in the same bucket but have
different hash values.  In this case you do want to double the
hashtable size.  This'll be fairly easy to achieve with small hash
tables - with 4 buckets, the inverse of the hash function divides the
key space into 4 sets, which in the best case are of equal size.  If
your keys happen to be such that they all come from the same set then
you'll never resize, but you should.

If you resize on average bucket utilization then your table will keep
growing but it'll always have <=
number_of_keys/max_bucket_utilization elements, so although it can be
large & mostly empty, it's only increasing memory usage by a small
constant factor (times the amount of memory used by the keys, values &
cons cells hooking them up).

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il