This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: regression caused by fix of bug #13691


I wrote:
> the most important
> place to fix is the mbrtowc() behaviour. But this is also the most
> difficult one. I cannot see how to make the following requirements
> coexist:
> 
>   * mbrtowc(&wc, "A", 1, &ps) shall set wc = L'A'.
> 
>   * mbrtowc(&wc, "A\xb0", 2, &ps) shall set wc = 0x00C0
>     (LATIN CAPITAL LETTER A WITH GRAVE)
> 
>   * mbrtowc can be used to process a string byte for byte; it returns -2
>     when a byte sequence is incomplete. In particular this means that the
>     sequence of calls
>       mbrtowc(&wc, "A", 1, &ps) => -2
>       mbrtowc(&wc, "\xb0", 1, &ps) => 1, wc = 0x00C0
>     produces an intermediate -2 without setting wc and then sets wc in the
>     second call.

A possible approach would be to exploit the fact that the gconv converters
can be programmed to behave differently in the wcsmbs situation than in the
iconv() and stdio situation: In the wcsmbs situation, consume_incomplete
is 1, whereas in the other situations it is 0. This parameter could be passed
down to the loop function, through EXTRA_LOOP_DECLS and EXTRA_LOOP_ARGS.

The idea would be to produce Unicode in NFD form (rather than the usual
NFC form) in the mb*towc* functions. That is, have
  mbrtowc(&wc, "A", 1, &ps) => 1, wc = L'A',
  mbrtowc(&wc, "\xb0", 1, &ps) => 1, wc = 0x0300,
and then hope that the caller can cope with decomposed Unicode character
sequences, from the 'wc' program to the display engine in X11.

But in my opinion this is not worth the effort, because non-Unicode
Vietnamese locales don't have a real user base.

Bruno


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]