This is the mail archive of the archer@sourceware.org mailing list for the Archer project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Python pretty-printers and non-ASCII strings do not play well together :-(

From: Paul Pluzhnikov <ppluzhnikov at google dot com>
To: Tom Tromey <tromey at redhat dot com>
Cc: archer at sourceware dot org
Date: Tue, 4 Nov 2008 17:39:02 -0800
Subject: Re: Python pretty-printers and non-ASCII strings do not play well together :-(
Dkim-signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta;t=1225849146; bh=xrRlqCZdfzLCV0CQZlapekwtRgw=;h=DomainKey-Signature:MIME-Version:In-Reply-To:References:Date: Message-ID:Subject:From:To:Cc:Content-Type: Content-Transfer-Encoding; b=xm8uPBfD20/UZxNCAo1R0KXZiecv/tmP2VSYs2AAPdSx1LK76r1sxeNPMFUr6Fq45yPQ1aHurc9UClvIjM/gow==
Domainkey-signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;h=mime-version:in-reply-to:references:date:message-id:subject:from:to:cc:content-type:content-transfer-encoding;b=CBw6VCHGH32gN9WQ8hPzECRjhqTp3hoPgckXMiLw4cnezoNqF5edx4U9uPzO9hPOgm68dviZMCbmwa4xaWrIDA==
References: <20081104192834.32F4D3A6B0B@localhost> <m3fxm7nuoo.fsf@fleche.redhat.com> <8ac60eac0811041159r14232e8br688601e55f6bf313@mail.gmail.com> <m38wrzm1ik.fsf@fleche.redhat.com>

On Tue, Nov 4, 2008 at 4:58 PM, Tom Tromey <tromey@redhat.com> wrote:

> Tom> What should happen here, though?  The string contains invalid
> Tom> characters for its declared (via set target-charset) encoding.
>
> Paul> As an end-user, I would expect something like
> Paul>   $2 = <"\xef\xcd\xab">
>
> It occurs to me I am not completely certain where this error
> originates.  My theory is that it is the call to PyUnicode_Decode in
> valpy_str.

The 'PyUnicode_Decode()' returns a PyObject, for which
PyUnicode_AsEncodedString() returns NULL.

Here is the trace of this happening:

Breakpoint 1, valpy_str (self=0x2aaaaaae7250) at
../../gdb/python/python-value.c:246
246       char *s = NULL;
(top) n
253       stb = mem_fileopen ();
(top)
254       old_chain = make_cleanup_ui_file_delete (stb);
(top)
256       TRY_CATCH (except, RETURN_MASK_ALL)
(top)
258           common_val_print (((value_object *) self)->value, stb, 0, 0, 0,
(top)
260           s = ui_file_xstrdup (stb, &dummy);
(top)
256       TRY_CATCH (except, RETURN_MASK_ALL)
(top) p s
$4 = 0xb04c90 "\"ïÍ\""
(top) n
262       GDB_PY_HANDLE_EXCEPTION (except);
(top)
264       do_cleanups (old_chain);
(top)
266       result = PyUnicode_Decode (s, strlen (s), host_charset (), NULL);
(top)
267       xfree (s);
(top) p result
$5 = (PyObject *) 0x2aaaaab71a80
(top) n
269       return result;
(top)
270     }
(top)

### Now return into Python interpreter ###

PyObject_Str (v=<value optimized out>) at ../../Objects/object.c:361
361             if (res == NULL)
(top)
360             res = (*v->ob_type->tp_str)(v);
(top)
361             if (res == NULL)
(top) p res
$6 = (PyObject *) 0x2aaaaab71a80
(top) n
364             if (PyUnicode_Check(res)) {
(top)
366                     str = PyUnicode_AsEncodedString(res, NULL, NULL);
(top)
367                     Py_DECREF(res);
(top) p str
$7 = (PyObject *) 0x0

> If so, then we aren't seeing a value representation problem, which is
> what I was worried about.  Instead, I think common_val_print is
> emitting a string which is not actually valid according to
> host_charset.  That seems wrong.
>
> We could work around this in valpy_str, I suppose.  But I'm curious to
> know why this is happening -- why isn't common_val_print printing the
> escape sequences itself?

I don't see any escape sequences here.
Note that 'raw' GDB doesn't print any escape sequences either,
just raw contents of the buffer.

> My guess is that the target and host charsets are the same, and
> charset.c is passing character through without checking them for
> validity.  I didn't debug it, but when I set host-charset to ASCII (my
> target-charset is ISO-8859-1), I do see the escapes.
>
> Every time I look at this stuff I'm reminded that the gdb charset code
> could use a good scrubbing.  For example, the default host charset
> ought to come from the locale settings.  I have a patch to implement
> this, but there's no point submitting it since it breaks gdb on
> typical Linux systems -- most people use UTF-8 locales, but gdb
> doesn't handle UTF-8.
>
> Maybe we should just install a smart Python printer for 'char *' ;-)
>
> Paul> What are some of the good Python references?
> Tom> http://www.python.org/doc/2.5.2/api/api.html
>
> Paul> Yes, I've seen the above, but it didn't have the answers I was
> Paul> looking for :(
>
> What do you want to know?  Both Thiago and I have worked in this area,
> maybe one of us knows.

How to turn raw buffer contents with unprintable characters into something
which will print as "\xef\xcd\xab" :)

Or "what PyUnicode_AsEncodedString() is actually supposed to do?"

-- 
Paul Pluzhnikov

Follow-Ups:
- Re: Python pretty-printers and non-ASCII strings do not play well together :-(
  - From: Thiago Jung Bauermann

References:
- Python pretty-printers and non-ASCII strings do not play well together :-(
  - From: Paul Pluzhnikov
- Re: Python pretty-printers and non-ASCII strings do not play well together :-(
  - From: Tom Tromey
- Re: Python pretty-printers and non-ASCII strings do not play well together :-(
  - From: Paul Pluzhnikov
- Re: Python pretty-printers and non-ASCII strings do not play well together :-(
  - From: Tom Tromey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]