This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices


https://sourceware.org/bugzilla/show_bug.cgi?id=19802

            Bug ID: 19802
           Summary: bad hash value distribution and horrible performance
                    for large arrays with multiple small-integer indices
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: raeburn at permabit dot com
  Target Milestone: ---

Created attachment 9082
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9082&action=edit
stap script to report on hash-value distribution for array indexing

I've been trying to figure out why a big SystemTap script of mine is so slow.
Running "perf top" against it shows me that it's spending a huge amount of time
walking through entries in a bucket in an associative array (holding
aggregates, if that matters).

It's a big array. I've declared it as holding 500000 elements, and after
noticing that MAXMAPENTRIES also controls the range of hash values, I set that
to 500000 as well. But the script still spends a huge amount of time in that
small loop. This made me question whether the distribution across buckets was
good.

In my script I'm using two small integers as keys; the first counts up from one
and sometimes gets to 10k or so, and the second never gets above about 70.
(Some combination of the two index values are never used, and we trim some of
the array contents periodically, so we shouldn't be filling it.)

So, I wrote a loop using embedded C to pass pairs of small integers to hash_iix
and processed the result. The results were disappointing: While the results
varied from run to run because of seeding, in each run the low bits changed
fairly little.

In one run, testing hash_iix(i,j) for i in 0..2000 and j in 0..70, with
MAXMAPENTRIES=500000 (so 19 bits of hash value, table size 524288, more than 3x
the number of entries to supposedly be stored in some associative array), about
47 of the hash values were repeated 35 times each.

I see two particular problems:

First, the hash function in map-gen.c is not a terribly good mixing function.
If the per-key hashes only vary in a few bits, especially if those bit-ranges
overlap between keys, the results won't be well distributed. And with several
keys, it's shifting the earlier results to the left and doing XORs, which means
the first key doesn't contribute at all to the lower bits of the output. (If
the hash table size were prime it would be another matter, but it's a power of
two so we're just masking off the high bits when we do the modulus operation.)

Second, hash_64 isn't a good choice for a hash function if you don't have a lot
of varying input bits, and you need a lot of output bits, and you don't follow
it with a good mixing function.

A hash table size of 524288 should be adequate room for the 140000 possible
combinations.

But hash_64 returns a truncated product of the input with:

   /*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
   #define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL

So asking for 19 bits back from hash_64 discards the low 45 bits of the
product, and the contribution from the 2^51 and higher terms doesn't really
affect the lowest 6 of the remaining bits. Those bits are only influenced by
the "-2^18+1" part of the multiplier. But if the varying low input bits are
fewer than about 26 bits, the product with "-2^18+1" won't vary in high enough
bits in the result to make much difference (if any) after the right shift.

Basically, the low bits coming back from hash_64 don't vary, in this test.

Do this for two keys with small ranges, and combine them poorly, and as I said
above, the XOR preserves the fixed-ness of the low bits.

There are tweaks that could be done to try to better propagate variations in a
few bits throughout the result, but it might be better to borrow work from
people who've already gone down that road, and incorporate code from CityHash
or one of the other fast ones with suitable licensing.

In my case, I've found a workaround: If instead of using "stat[A,B]" (A
counting up from 1, B in 1..70) I use:

  stat[(A * 128 + B) << 26]

... then the distribution is much more even, and the SystemTap probe routine
I'm using is no longer taking half a core's worth of CPU time in the thread
that triggers it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]