This is the mail archive of the
mailing list for the Archer project.
Re: Python pretty-printers and non-ASCII strings do not play well together :-(
- From: Doug Evans <dje at google dot com>
- To: Tom Tromey <tromey at redhat dot com>
- Cc: Paul Pluzhnikov <ppluzhnikov at google dot com>, archer at sourceware dot org
- Date: Wed, 5 Nov 2008 13:52:46 -0800
- Subject: Re: Python pretty-printers and non-ASCII strings do not play well together :-(
- Dkim-signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta;t=1225921968; bh=NhEAbNp+Gfp3Pl4ThaNZDW37sEU=;h=DomainKey-Signature:MIME-Version:In-Reply-To:References:Date: Message-ID:Subject:From:To:Cc:Content-Type: Content-Transfer-Encoding; b=BpA85pNJe+vdN/3v0pLkLbUm7eTILr+AyGZ0gtSdpvh1Z7POt8Mjs4hYVZX2D3UBVVmb8czhazFXTiVMdkZt2Q==
- Domainkey-signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;h=mime-version:in-reply-to:references:date:message-id:subject:from:to:cc:content-type:content-transfer-encoding;b=vuHVbm4feYOdYL1TK2ExFd7CJKcoM+40dk3XSHVppklT47x+eAT1DtP0Sid7VG0jDqUj+ApawraZvUQJRCIIPg==
- References: <20081104192834.32F4D3A6B0B@localhost> <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org>
On Tue, Nov 4, 2008 at 4:58 PM, Tom Tromey <email@example.com> wrote:
> Tom> What should happen here, though? The string contains invalid
> Tom> characters for its declared (via set target-charset) encoding.
> Paul> As an end-user, I would expect something like
> Paul> $2 = <"\xef\xcd\xab">
> It occurs to me I am not completely certain where this error
> originates. My theory is that it is the call to PyUnicode_Decode in
> If so, then we aren't seeing a value representation problem, which is
> what I was worried about. Instead, I think common_val_print is
> emitting a string which is not actually valid according to
> host_charset. That seems wrong.
> We could work around this in valpy_str, I suppose. But I'm curious to
> know why this is happening -- why isn't common_val_print printing the
> escape sequences itself?
> My guess is that the target and host charsets are the same, and
> charset.c is passing character through without checking them for
> validity. I didn't debug it, but when I set host-charset to ASCII (my
> target-charset is ISO-8859-1), I do see the escapes.
> Every time I look at this stuff I'm reminded that the gdb charset code
> could use a good scrubbing. For example, the default host charset
> ought to come from the locale settings. I have a patch to implement
> this, but there's no point submitting it since it breaks gdb on
> typical Linux systems -- most people use UTF-8 locales, but gdb
> doesn't handle UTF-8.
> Maybe we should just install a smart Python printer for 'char *' ;-)
It seems(!) like the right solution is to make gdb unicode-aware. It
might mean going with utf8 internally and only converting at the
boundaries, I don't know.