This is the mail archive of the mailing list for the GDB project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: printing wchar_t*

On Monday 17 April 2006 12:35, Eli Zaretskii wrote:

> >   - If it sees \x, look at the following hex digits and convert it to
> > either code point or code unit
> >   - If it sees anything else, convert it from local 8-bit to Unicode
> That's what Jim was saying.  He thought (or so it seemed to me) that,
> once the ASCII-encoded string was read by the front end and converted
> back to the integer values, the job is done.  That is, in Jim's
> example with L"123\x0f04\x0fccxyz", the character `1' is converted to
> its code 49 decimal, \x0f04 is converted to the 16-bit code 3844
> decimal, `x' is converted to 120 decimal, etc.
> What I was saying that indeed this conversion is easy, but it's not
> even close to doing what the front end generally would like to do with
> the string.  You want to _process_ the string, which means you want to
> know its length in characters (not bytes), you want to know what
> character set they encode, you want to be able to find the n-th
> character in the string, etc.  The encoding suggested by Jim makes
> these tasks very hard, much harder than if we send the string as an
> array of fixed-length wide characters.

That's a *completely* different topic. First, frontend needs to get the data, 
in whatever form. Using \x escapes is just as suitable as using list of hex 
values -- those approaches are just isomorphic. Second, frontend needs to 
display the data, however it will operate using its own data structures, and 
it does not matter if \x escapes were used or not. No frontend will ever work 
on a string containing embedded "\x" escapes.

> > Note that due to charset function interface using 'int', you can't use
> > UTF-8 for encoding passed to frontend, but using ASCII + \x is still
> > feasible.
> I don't understand why UTF-8 cannot be used (an int can hold an 8-bit
> byte just fine), 

Int can't hold 6 bytes, at least on common machines. And interface is 
charset.h requires that result of conversion of one host character to one 
target character fit into int. Anyway, I don't think charset.h was designed 
with Unicode in mind, so we probably should stop dicussing it.

> > There's one nice thing about this approach. If there's new 'print array
> > until XX" syntax, I indeed need to special-case processing of values in
> > several contexts -- most notably arguments in stack trace. With "\x"
> > escapes I'd need to write a code to handle them once. In fact, I can add
> > this code right to MI parser (which operates using Unicode-enabled
> > QString class already). That will be more convenient than invoking 'print
> > array' for any wchar_t* I ever see.
> I don't think we should optimize GDB for one specific toolkit, even if
> that toolkit is Qt.

Replace QString with Gtkmm::ustring and the same argument holds. Whenever 
string type is used inside frontend to represent Unicode string, you can 
perform the conversion from \x escapes to that string class in one place, and 
don't do this separately, inside variable display widget, inside stack 
display widget and where not.

> > I don't quite get. First you say you want \x05D2 to display using Unicode
> > font on console, now you say it's very hard.
> No, I said that a GUI front end will be able to display the _binary_
> _code_ 0x05D2 with a suitable Unicode font.  Jim suggested that seeing
> the _string_ "\x05D2" in GDB's output will allow me to read the text,
> to which I replied that it will not be easy at all, since humans
> generally don't remember Unicode codepoints by heart, even for their
> native languages.

Ok, seeing the string "\x05D2" will be sufficient for frontend.

> > > GDB cannot be asked to know about all of those complications, but I
> > > think it should at least provide a few simple translation services so
> > > that a front end will not have to work too hard to handle and display
> > > strings as mostly readable text.  Passing the characters as fixed-size
> > > codepoints expressed as ASCII hex strings leaves the front-end with
> > > only very simple job.  What's more, it uses an existing feature: array
> > > printing.
> >
> > Using \x escapes, provided they encode *code units*, leaves frontend with
> > the same simple job.
> Yes, but GDB will need to generate the code units first, e.g. convert
> fixed-size Unicode wide characters into UTF-8.  

Sorry, where does that UTF-8 comes from? If you generate ASCII + \x escapes, 
you don't need UTF-8. 

> That's extra job for 
> GDB.  (Again, we were originally talking about wchar_t, not multibyte
> strings.)

I don't understand what's this extra job. This is as simple as:

   for c in wchar_t* literal:
       if c is representable in host encoding:

> > Really, using strings with \x escapes differs from array
> > printing in just one point: some characters are printed not as hex
> > values, but as characters in local 8-bit encoding. Why do you think this
> > is a problem?
> Because knowing what is the ``local 8-bit encoding'' is in itself a
> huge problem.  Emacs is trying to solve it since 1996, and it still
> haven't got all the details right in some marginal cases, although we
> have people on the Emacs development team who understand more about
> i18n than I ever will.  In short, there's no reliable method of
> finding out what is the correct 8-bit encoding in which to talk to any
> given text-mode display.

I trust you on that, but nothing prevents user/frontend to explicitly specify 
the encoding.

> And you certainly do NOT want any local 8-bit encodings when you are
> going to display the string on a GUI, because that would require that
> the front end does some extra job of converting the encoded text back
> to what it needs to communicate with the text widgets.

I would expect that any GUI toolkit that pretend to support Unicode *has* to 
support conversion from local 8 bit encodings. Otherwise, such toolkit is of 
no use in real world.

By the way, unless your target encoding is ASCII, frontend has to be aware of 
local 8 bit encoding anyway. If I wrote program using KOI8-R and frontend 
shows the char* (not wchar_t*) strings as ASCII, the frontend is broken 

> > > And why are you talking about host character set?  The
> > > L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> > > converted it to 7-bit ASCII.  These are characters from the target
> > > character set.  And the target doesn't necessarily talk in the host
> > > locale's character set and language, you could be debugging a program
> > > which talks Farsi with GDB that runs in a German locale.
> >
> > So, characters that happen to exist in German locale are printed as
> > literal chars. Other characters are printed using \x. FE reads the
> > string, and when it sees literal char, it converts it from German locale
> > to Unicode used internally. Where's the problem?
> If this conversion is lossless, it's redundant.  It is easier to just
> send everything as hex escapes, since no human will see them, only the
> FE.  This saves the needless conversion (and potential problems with
> incorrect notion of the current locale and encoding).

Well, using string with just hex escapes is fine for frontend. It might not be 
as fine to the user.

> But some conversions to ``literal characters'' (i.e. to 8-bit binary
> codes) are lossy, because the underlying converter needs state
> information to correctly interpret the byte stream.  This state
> information is thrown away once the conversion is done, and so the
> opposite conversion fails to reconstruct the original codepoints.
> This is usually the case with ISO-2022 encodings.
> So I think on balance it's better to send the original wide characters
> as hex, the only downside being that it uses more bytes per character.
> (Again, this is about GUI front ends, not about GDB's own CLI output
> routines.)

Well, I'd prefer to address one problem at a time:

1. Gbd should be modified to print wchar_t* literals. It should use the same 
logic as for char* to decide if value is representable in the host charset, 
and use \x escapes otherwise.

2. If you believe that using literals is not suitable for MI, that can be a 
separate change.

- Volodya

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]