This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

From: Simon Marchi <simon dot marchi at ericsson dot com>
To: Pedro Alves <palves at redhat dot com>, Simon Marchi <simark at simark dot ca>, <gdb-patches at sourceware dot org>
Date: Mon, 20 Nov 2017 11:50:11 -0500
Subject: Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
Authentication-results: sourceware.org; auth=none
Authentication-results: spf=none (sender IP is ) smtp.mailfrom=simon dot marchi at ericsson dot com;
References: <5d721d13-d886-0400-db6b-76485c545142@redhat.com> <1511138515-25996-1-git-send-email-palves@redhat.com> <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca> <db3aa595-1d35-51f5-bd34-1aa04707a6c8@redhat.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 2017-11-20 06:56 AM, Pedro Alves wrote:
>>> +/* Starting from a search name, return the string that finds the upper
>>> +   bound of all strings that start with SEARCH_NAME in a sorted name
>>> +   list.  Returns the empty string to indicate that the upper bound is
>>> +   the end of the list.  */
>>> +
>>> +static std::string
>>> +make_sort_after_prefix_name (const char *search_name)
>>> +{
>>> +  /* When looking to complete "func", we find the upper bound of all
>>> +     symbols that start with "func" by looking for where we'd insert
>>> +     "func"-with-last-character-incremented, i.e. "fund".  */
>>> +  std::string after = search_name;
>>> +
>>> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
>>> +     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
>>> +     allowing it in identifiers.  If we run into it, increment the
>>> +     previous character instead and shorten the string.  If the very
>>> +     first character turns out to be 0xff, then the upper bound is the
>>> +     end of the list.
>>
>> It's a bit of a nit, but I think this explanation could be a bit more
>> precise, and maybe simpler.  Maybe you could just say that you strip all
>> trailing 0xff characters, and increment the last non-0xff character of
>> the string.  If the string is composed only of 0xff characters, then the
>> upper bound is the end of the list.
> 
> My problem with that is that it wouldn't explain _why_ we strip
> the 0xffs.

Right, the comment should say why, not how.

>>
>> The "If the very first character turns out to be 0xff" threw me off a bit,
>> because if you have the string "\xffa\xff", the upper bound will be "\xffb",
>> not the end of the list, despite the very first character being 0xff.
> 
> I like that example.  How about the following.  It's even longer, but
> I think it's justified.
> 
> /* Starting from a search name, return the string that finds the upper
>    bound of all strings that start with SEARCH_NAME in a sorted name
>    list.  Returns the empty string to indicate that the upper bound is
>    the end of the list.  */
> 
> static std::string
> make_sort_after_prefix_name (const char *search_name)
> {
>   /* When looking to complete "func", we find the upper bound of all
>      symbols that start with "func" by looking for where we'd insert
>      the closest string that would follow "func" in lexicographical
>      order.  Usually, that's "func"-with-last-character-incremented,
>      i.e. "fund".  Mind non-ASCII characters, though.  Usually those
>      will be UTF-8 multi-byte sequences, but we can't be certain.
>      Especially mind the 0xff character, which is a valid character in
>      non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
>      rule out compilers allowing it in identifiers.  Note that
>      conveniently, strcmp/strcasecmp are specified to compare
>      characters interpreted as unsigned char.  So what we do is treat
>      the whole string as a base 255 number composed of a sequence of
>      base 255 "digits" and add 1 to it.  I.e., adding 1 to 0xff wraps
>      to 0, and carries 1 to the following more-significant position.
>      If the very first character carries/overflows, then the upper
>      bound is the end of the list.  Also the string after the empty
>      string is also the empty string.

Making an analogy with base-10 arithmetic is actually what made me
understand it.  The number after 149 is not 140, it's 150.  We're
doing the string equivalent of that.  Your explanation with base-255
numbers is very good.  It doesn't really work for all-0xff strings,
because adding one (with carry) to "\xff\xff" would give "\x01\x00\x00",
but it doesn't really matter for the explanation :).

Simon

Follow-Ups:
- Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
  - From: Pedro Alves

References:
- Re: [PATCH 26/40] Optimize .gdb_index symbol name searching
  - From: Pedro Alves
- [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
  - From: Pedro Alves
- Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
  - From: Simon Marchi
- Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
  - From: Pedro Alves

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]