This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Tel <telford@eng.uts.edu.au> writes: > > The two key assumptions are that the hash function approximates a > > random variable and that the input data is randomly distributed. > > Neither is really the case. > > I see only one assumption here, the idea of a good hash algorithm > is that it converts non-random input data into what looks like a random > variable. The distribution of the input data really should not matter. Well, yes, when phrased that way. But actually, the hash fcn *is* a fcn, so the output distribution is a function of the input distribution. So, it's actually an issue of how close to uniformly distributed the output is on the given input distribution. <snip> > OK, how about this proposal (for those who panic about the worst case): > Define ``bucket vector utilisation'' as the number of non-empty buckets > divided by the total buckets (not a difficult number to keep track of). > Determine an upper and lower bound for bucket vector utilisation, > grow when you get bigger than the upper bound, shrink when you get less > than the lower bound (thus still giving some hysteresis). > > In the average case (based on the probability stuff) bucket vector > utilisation will be 1 minus the probability of an empty bucket, from > Harvey's figures: > > table utilisation bucket vector utilisation > 1.0 0.635 > 2.0 0.867 > 3.0 0.952 > > So on the average, choosing upper and lower bounds for bucket > vector utilisation is much the same as choosing upper and lower > bounds for table utilisation. You have to make sure that your > upper bound is less than 1.0! Finding really good values may take > some fiddling but you have the average case covered and you have > some theory to make you feel righteous. > > On the other hand, in the worst case of all keys falling into > the same bucket, the bucket vector utilisation won't change so > the table will never grow (i.e. linear performance with no memory loss). This is no good. All the kesy can be in the same bucket but have different hash values. In this case you do want to double the hashtable size. This'll be fairly easy to achieve with small hash tables - with 4 buckets, the inverse of the hash function divides the key space into 4 sets, which in the best case are of equal size. If your keys happen to be such that they all come from the same set then you'll never resize, but you should. If you resize on average bucket utilization then your table will keep growing but it'll always have <= number_of_keys/max_bucket_utilization elements, so although it can be large & mostly empty, it's only increasing memory usage by a small constant factor (times the amount of memory used by the keys, values & cons cells hooking them up). -- Harvey J. Stein BFM Financial Research hjstein@bfr.co.il