This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Note on choosing string hash functions

From: Pedro Alves <palves at redhat dot com>
To: Dmitry Antipov <dantipov at nvidia dot com>, gdb at sourceware dot org
Date: Fri, 17 Nov 2017 13:42:25 +0000
Subject: Re: Note on choosing string hash functions
Authentication-results: sourceware.org; auth=none
References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com>

On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
> 
> r = r * 67 + c - 113
> 
> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
> 
> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
> 
> Everyone assumes that 'tolower' is simple, fast, and usually implemented
> as a macro. But when >1M of mangled C++ symbols should be hashed, results
> may be somewhat surprising ('perf report'):

I've noticed tolower show up in perf too, along with isspace,
in strncmp_iw/strncmp_iw_with_mode and friends.

I think the tolower is there for source languages that
are case insensitive, like Ada and Fortran.  (Or if you do
"set case-sensitive off", but I think that's a
misfeature...).

I haven't explicitly tried measuring perf with this patch:

 https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html

but I think it should improve performance significantly,
since safe-ctype.h is locale independent.

(
Of course, this ignores non-ASCII code points, but I'm not
sure we really handle non-ASCII correctly everywhere else,
i.e., whether it makes a difference today.  Certainly the
locale when the program was compiled generally has no relation
to the current user's locale, even though they frequently
match.  If it does make a difference, one way to handle
it that may be still cheap is to hash all non-ASCII (and utf8
multi-byte sequences) to the same, while still doing proper
lowercase comparisons when actually comparing symbols that end
up in the bucket.  Something like:

#define SYMBOL_HASH_NEXT(hash, c)			\
  ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)

I'd assume that most symbol names are wholly or at least mostly
ASCII and that that wouldn't result in too many collisions.

But then again, I don't really know how compilers for case-insensitive
languages handle non-ASCII symbol names in different cases, especially
considering multibyte sequences.
)

> 
> Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3
> -march=native -mtune=native,
> "remote" (both gdb and gdbserver on the same system) debugging Firefox
> debug build, reports was generated
> with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /"
> -ex "set pagination off" \
> -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.

Thanks,
Pedro Alves

Follow-Ups:
- Re: Note on choosing string hash functions
  - From: Pedro Alves

References:
- Note on choosing string hash functions
  - From: Dmitry Antipov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]