This is the mail archive of the
gdb-prs@sourceware.org
mailing list for the GDB project.
[Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3
- From: "naesten at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: gdb-prs at sourceware dot org
- Date: Thu, 10 Jul 2014 02:46:21 +0000
- Subject: [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=17138
Bug ID: 17138
Summary: C strings, gdb.Value.__str__, and Python 3
Product: gdb
Version: 7.7
Status: NEW
Severity: normal
Priority: P2
Component: python
Assignee: unassigned at sourceware dot org
Reporter: naesten at gmail dot com
I wanted to see how GDB's Python support dealt with strange C strings in Python
3 (using a build of 7.7.1 based on the Debian packaging git).
Now, I should probably start by reminding everyone that Python 3 has changed
the rules for strings: where in Python 2 the "" syntax and the corresponding
function-like-class, str(), implicitly refer to byte strings of no particular
encoding, in Python 3 they refer to Unicode strings, though code units can
nevertheless be 1, 2, or 4 bytes long. Python 3 (and 2.6+) have a new b""
syntax and bytes() type for strings of bytes (which while they might sometimes
resemble text, should never be confused with actual text, unless of course they
actually do represent text, in which case they should be decoded).
Note: Probably all of the str() calls in the following are technically
redundant with the use of print(), but for clarity I will include them anyway.
The parentheses around the argument to print are mandatory in Python 3, as the
print keyword has been replaced by a builtin function.
The first thing I tried had what looked like VERY strange results:
(gdb) python print(str(gdb.parse_and_eval('"foo\x80"')))
"foo\302\200"
(gdb)
... until I realized that the escape was presumably being handled by Python
here, and so was treated as referring to U+0080, so GDB just encoded it as
UTF-8 before trying to parse it, with the obvious results.
So next I tried:
(gdb) python print(str(gdb.parse_and_eval('"foo\\x80"')))
"foo\200"
(gdb)
... which looks like GDB just invented UCS-1.
I also tried calling functions like len() and bytes() on these char* values,
only to find that they were not implemented.
Around this point, I decided to consult the documentation, which I discovered
did not mention the __str__() method *anywhere*, but did talk of a string()
method, so I tried that out instead:
(gdb) python print(gdb.parse_and_eval('"foo\\x80"').string())
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3: invalid
start byte
Error while executing Python code.
(gdb)
At last, results that I can actually understand! (Unhelpful though they may
be.)
Are we sure this is the right default here? Might it not make more sense to
return bytes unless specifically asked for an encoding?
At the very least, we definitely provide a way to get uninterpreted bytes in a
bytes() object for Python 2.6+.
--
You are receiving this mail because:
You are on the CC list for the bug.