This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: regression caused by fix of bug #13691
- From: Bruno Haible <bruno at clisp dot org>
- To: Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 14 May 2012 04:31:31 +0200
- Subject: Re: regression caused by fix of bug #13691
- Bcc: bruno at haible dot de
- References: <1995140.sSugJaaxUI@linuix>
I wrote:
> the most important
> place to fix is the mbrtowc() behaviour. But this is also the most
> difficult one. I cannot see how to make the following requirements
> coexist:
>
> * mbrtowc(&wc, "A", 1, &ps) shall set wc = L'A'.
>
> * mbrtowc(&wc, "A\xb0", 2, &ps) shall set wc = 0x00C0
> (LATIN CAPITAL LETTER A WITH GRAVE)
>
> * mbrtowc can be used to process a string byte for byte; it returns -2
> when a byte sequence is incomplete. In particular this means that the
> sequence of calls
> mbrtowc(&wc, "A", 1, &ps) => -2
> mbrtowc(&wc, "\xb0", 1, &ps) => 1, wc = 0x00C0
> produces an intermediate -2 without setting wc and then sets wc in the
> second call.
A possible approach would be to exploit the fact that the gconv converters
can be programmed to behave differently in the wcsmbs situation than in the
iconv() and stdio situation: In the wcsmbs situation, consume_incomplete
is 1, whereas in the other situations it is 0. This parameter could be passed
down to the loop function, through EXTRA_LOOP_DECLS and EXTRA_LOOP_ARGS.
The idea would be to produce Unicode in NFD form (rather than the usual
NFC form) in the mb*towc* functions. That is, have
mbrtowc(&wc, "A", 1, &ps) => 1, wc = L'A',
mbrtowc(&wc, "\xb0", 1, &ps) => 1, wc = 0x0300,
and then hope that the caller can cope with decomposed Unicode character
sequences, from the 'wc' program to the display engine in X11.
But in my opinion this is not worth the effort, because non-Unicode
Vietnamese locales don't have a real user base.
Bruno