This is the mail archive of the gdb-prs@sourceware.org mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3


https://sourceware.org/bugzilla/show_bug.cgi?id=17138

            Bug ID: 17138
           Summary: C strings, gdb.Value.__str__, and Python 3
           Product: gdb
           Version: 7.7
            Status: NEW
          Severity: normal
          Priority: P2
         Component: python
          Assignee: unassigned at sourceware dot org
          Reporter: naesten at gmail dot com

I wanted to see how GDB's Python support dealt with strange C strings in Python
3 (using a build of 7.7.1 based on the Debian packaging git).

Now, I should probably start by reminding everyone that Python 3 has changed
the rules for strings: where in Python 2 the "" syntax and the corresponding
function-like-class, str(), implicitly refer to byte strings of no particular
encoding, in Python 3 they refer to Unicode strings, though code units can
nevertheless be 1, 2, or 4 bytes long.  Python 3 (and 2.6+) have a new b""
syntax and bytes() type for strings of bytes (which while they might sometimes
resemble text, should never be confused with actual text, unless of course they
actually do represent text, in which case they should be decoded).

Note: Probably all of the str() calls in the following are technically
redundant with the use of print(), but for clarity I will include them anyway. 
The parentheses around the argument to print are mandatory in Python 3, as the
print keyword has been replaced by a builtin function.

The first thing I tried had what looked like VERY strange results:

(gdb) python print(str(gdb.parse_and_eval('"foo\x80"')))
"foo\302\200"
(gdb) 

... until I realized that the escape was presumably being handled by Python
here, and so was treated as referring to U+0080, so GDB just encoded it as
UTF-8 before trying to parse it, with the obvious results.

So next I tried:

(gdb) python print(str(gdb.parse_and_eval('"foo\\x80"')))
"foo\200"
(gdb) 

... which looks like GDB just invented UCS-1.

I also tried calling functions like len() and bytes() on these char* values,
only to find that they were not implemented.

Around this point, I decided to consult the documentation, which I discovered
did not mention the __str__() method *anywhere*, but did talk of a string()
method, so I tried that out instead:

(gdb) python print(gdb.parse_and_eval('"foo\\x80"').string())
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3: invalid
start byte
Error while executing Python code.
(gdb) 

At last, results that I can actually understand!  (Unhelpful though they may
be.)

Are we sure this is the right default here?  Might it not make more sense to
return bytes unless specifically asked for an encoding?

At the very least, we definitely provide a way to get uninterpreted bytes in a
bytes() object for Python 2.6+.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]