This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
On 2017-11-20 06:56 AM, Pedro Alves wrote:
>>> +/* Starting from a search name, return the string that finds the upper
>>> + bound of all strings that start with SEARCH_NAME in a sorted name
>>> + list. Returns the empty string to indicate that the upper bound is
>>> + the end of the list. */
>>> +
>>> +static std::string
>>> +make_sort_after_prefix_name (const char *search_name)
>>> +{
>>> + /* When looking to complete "func", we find the upper bound of all
>>> + symbols that start with "func" by looking for where we'd insert
>>> + "func"-with-last-character-incremented, i.e. "fund". */
>>> + std::string after = search_name;
>>> +
>>> + /* Mind 0xff though, which is a valid character in non-UTF-8 source
>>> + character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
>>> + allowing it in identifiers. If we run into it, increment the
>>> + previous character instead and shorten the string. If the very
>>> + first character turns out to be 0xff, then the upper bound is the
>>> + end of the list.
>>
>> It's a bit of a nit, but I think this explanation could be a bit more
>> precise, and maybe simpler. Maybe you could just say that you strip all
>> trailing 0xff characters, and increment the last non-0xff character of
>> the string. If the string is composed only of 0xff characters, then the
>> upper bound is the end of the list.
>
> My problem with that is that it wouldn't explain _why_ we strip
> the 0xffs.
Right, the comment should say why, not how.
>>
>> The "If the very first character turns out to be 0xff" threw me off a bit,
>> because if you have the string "\xffa\xff", the upper bound will be "\xffb",
>> not the end of the list, despite the very first character being 0xff.
>
> I like that example. How about the following. It's even longer, but
> I think it's justified.
>
> /* Starting from a search name, return the string that finds the upper
> bound of all strings that start with SEARCH_NAME in a sorted name
> list. Returns the empty string to indicate that the upper bound is
> the end of the list. */
>
> static std::string
> make_sort_after_prefix_name (const char *search_name)
> {
> /* When looking to complete "func", we find the upper bound of all
> symbols that start with "func" by looking for where we'd insert
> the closest string that would follow "func" in lexicographical
> order. Usually, that's "func"-with-last-character-incremented,
> i.e. "fund". Mind non-ASCII characters, though. Usually those
> will be UTF-8 multi-byte sequences, but we can't be certain.
> Especially mind the 0xff character, which is a valid character in
> non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
> rule out compilers allowing it in identifiers. Note that
> conveniently, strcmp/strcasecmp are specified to compare
> characters interpreted as unsigned char. So what we do is treat
> the whole string as a base 255 number composed of a sequence of
> base 255 "digits" and add 1 to it. I.e., adding 1 to 0xff wraps
> to 0, and carries 1 to the following more-significant position.
> If the very first character carries/overflows, then the upper
> bound is the end of the list. Also the string after the empty
> string is also the empty string.
Making an analogy with base-10 arithmetic is actually what made me
understand it. The number after 149 is not 140, it's 150. We're
doing the string equivalent of that. Your explanation with base-255
numbers is very good. It doesn't really work for all-0xff strings,
because adding one (with carry) to "\xff\xff" would give "\x01\x00\x00",
but it doesn't really matter for the explanation :).
Simon