This is the mail archive of the
gdb@sourceware.org
mailing list for the GDB project.
Re: Note on choosing string hash functions
- From: Pedro Alves <palves at redhat dot com>
- To: Dmitry Antipov <dantipov at nvidia dot com>, gdb at sourceware dot org
- Date: Fri, 17 Nov 2017 13:42:25 +0000
- Subject: Re: Note on choosing string hash functions
- Authentication-results: sourceware.org; auth=none
- References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com>
On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>
> r = r * 67 + c - 113
>
> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>
> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>
> Everyone assumes that 'tolower' is simple, fast, and usually implemented
> as a macro. But when >1M of mangled C++ symbols should be hashed, results
> may be somewhat surprising ('perf report'):
I've noticed tolower show up in perf too, along with isspace,
in strncmp_iw/strncmp_iw_with_mode and friends.
I think the tolower is there for source languages that
are case insensitive, like Ada and Fortran. (Or if you do
"set case-sensitive off", but I think that's a
misfeature...).
I haven't explicitly tried measuring perf with this patch:
https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
but I think it should improve performance significantly,
since safe-ctype.h is locale independent.
(
Of course, this ignores non-ASCII code points, but I'm not
sure we really handle non-ASCII correctly everywhere else,
i.e., whether it makes a difference today. Certainly the
locale when the program was compiled generally has no relation
to the current user's locale, even though they frequently
match. If it does make a difference, one way to handle
it that may be still cheap is to hash all non-ASCII (and utf8
multi-byte sequences) to the same, while still doing proper
lowercase comparisons when actually comparing symbols that end
up in the bucket. Something like:
#define SYMBOL_HASH_NEXT(hash, c) \
((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
I'd assume that most symbol names are wholly or at least mostly
ASCII and that that wouldn't result in too many collisions.
But then again, I don't really know how compilers for case-insensitive
languages handle non-ASCII symbol names in different cases, especially
considering multibyte sequences.
)
>
> Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3
> -march=native -mtune=native,
> "remote" (both gdb and gdbserver on the same system) debugging Firefox
> debug build, reports was generated
> with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /"
> -ex "set pagination off" \
> -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.
Thanks,
Pedro Alves