This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Vladimir Prus <ghost at cs dot msu dot su>
To: Eli Zaretskii <eliz at gnu dot org>
Cc: jimb at red-bean dot com, gdb at sources dot redhat dot com
Date: Mon, 17 Apr 2006 16:16:26 +0400
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <200604171301.59881.ghost@cs.msu.su> <uzmikxxab.fsf@gnu.org>

On Monday 17 April 2006 15:21, Eli Zaretskii wrote:

> > > What I was saying that indeed this conversion is easy, but it's not
> > > even close to doing what the front end generally would like to do with
> > > the string.  You want to _process_ the string, which means you want to
> > > know its length in characters (not bytes), you want to know what
> > > character set they encode, you want to be able to find the n-th
> > > character in the string, etc.  The encoding suggested by Jim makes
> > > these tasks very hard, much harder than if we send the string as an
> > > array of fixed-length wide characters.
> >
> > That's a *completely* different topic.
>
> Yes, it is.  But we must keep it in mind because the front ends want
> strings to do something with them.

Eli, I think we're running in circles. I'd like to reiterate why I ideally 
want from gdb:

  1. For any wchar_t* value, be it value of a variable, or function
     parameter three levels up the stack, or member of structure, I want
     gdb to print that value in specific format that's easy for frontend
     to use. String with escapes is fine.
  2. I want that formatting to take effect both for MI commands and for
     'print' command, since the user can issue 'print' command manually.
  3. I don't mind having this behaviour only when --interpreter=mi is
     specified.

I think that two question we did not agree on are:

  1. When talking to FE, should literals be used at all, or string should 
     consist of just \x escapes.
  2. When talking to user, should we use string literals, or just \x escapes.

I hope you'll agree that using \x escapes when talking to user in not 
acceptable. And since gdb right now assumes ASCII charset for output, I don't 
think there will be any problems if ASCII characters are output as-is, 
without escaping.

> > Second, frontend needs to display the data, however it will operate
> > using its own data structures, and it does not matter if \x escapes
> > were used or not. No frontend will ever work on a string containing
> > embedded "\x" escapes.
>
> I was saying that the ASCII encoding suggested by Jim makes it harder
> to convert the text into wide characters, that's all.

I don't see why it's so, but nevermind.

> > > That's extra job for
> > > GDB.  (Again, we were originally talking about wchar_t, not multibyte
> > > strings.)
> >
> > I don't understand what's this extra job. This is as simple as:
> >
> >    for c in wchar_t* literal:
> >        if c is representable in host encoding:
> >             output_literal
> >        else
> >             output_hex_escape
>
> That might sound simple for you, but it isn't, in general.  The
> ``representable in host encoding'' part is very non-trivial; for
> example, how do you tell whether the Unicode codepoints 0x05C3 and
> 0x05C4 can be represented in the Windows codepage 1255 (the former
> can, the latter cannot)?  This is generally impossible without using
> very complicated algorithms and/or large data bases.
>
> The other complex part is ``output_literal'': again, there's no simple
> algorithm to map Unicode's 0x05C3 into cp1255's 0xD3.  You need tables
> again, and you need separate tables for each possible encoding (Hebrew
> has at least 3 widely used ones, Russian has at least 5, etc.).

iconv has those tables. You see problems where there are none.

> > > > Really, using strings with \x escapes differs from array
> > > > printing in just one point: some characters are printed not as hex
> > > > values, but as characters in local 8-bit encoding. Why do you think
> > > > this is a problem?
> > >
> > > Because knowing what is the ``local 8-bit encoding'' is in itself a
> > > huge problem.
> >
> > [...]
> > I trust you on that, but nothing prevents user/frontend to explicitly
> > specify the encoding.
>
> What makes you think the user and/or front end will know what to
> specify?  Experience shows they generally don't.

First you say it's not possible to detect encoding from environment. Then you 
say you can't trust user/frontend. Together, that sounds like the problem of 
making gdb print char* literals reliably is impossible. Is that what you're 
trying to say? 

> > 1. Gbd should be modified to print wchar_t* literals.
>
> ``Print'' is ambiguous in this context.  I believe you mean ``send to
> the front end'', since this was your original problem.  If the front
> end is charged with displaying the wchar_t strings, GDB does not need
> to print anything by itself.  Am I right?
>
> > It should use the same
> > logic as for char* to decide if value is representable in the host
> > charset,
>
> I hope I explained above why this part is highly non-trivial.  

Using existing logic is in fact absolutely trivial -- that logic already 
*exists*, you don't need to do anything. 

> That is 
> why I think GDB should use hex notation for all characters, and leave
> it for the FE to deal with their display.

I disagree, for the simple reason that for char* values, existing logic did 
not cause any problems. Also, while I can take a stab at wchar_t* output, I 
would not be comfortable with special casing wchar_t* output to frontend.

- Volodya

Follow-Ups:
- Re: printing wchar_t*
  - From: Eli Zaretskii

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]