This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Bug in _mbtowc_r UTF-8 handling

From: Jeff Johnston <jjohnstn at redhat dot com>
To: newlib at sourceware dot org
Date: Tue, 28 Jul 2009 12:36:59 -0400
Subject: Re: [PATCH] Bug in _mbtowc_r UTF-8 handling
References: <20090728101425.GA20809@calimero.vinschen.de>

Corinna Vinschen wrote:

Hi,

there's a bug in __utf8_mbtowc, which appears to be in that code for
ages.  If you give an UTF-8 multibyte character to this function one
at a time, it screws up counting and testing the value `n', the number
of bytes in the input stream.

Here's a testcase:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>

int main(void) {
  wchar_t wc;
  size_t ret;
  mbstate_t s = { 0 };
  puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
  printf("%i\n", mbrtowc(&wc, "\xe2", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x94", 1, &s));
  printf("%i\n", mbrtowc(&wc, "\x84", 1, &s));
  printf("%x\n", wc);
  return 0;
}

That's the expected output:

en_GB.UTF-8
-2
-2
1
2504

That's what happens with newlib's implementation:

en_GB.UTF-8
-2
-1
-1
<arbitrary value>

The sequence E2 94 84 should translate to U+2504. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:

printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));

As I mentioned afore, the problem is how the argument `n' is counted
when `state' is already filled and subsequent calls encounter a
non-empty state.  As soon as the incoming byte sequence consists of
a multibyte followup sequence, and this sequence is in itself incomplete
again, the value of `n' is wrong and the tests which are supposed to
return -2 fail.  The code slips through, reading a char too much, and
either fails, or, worst case, returns an incorrect wchar_t value.
Given that, the problem never happens for two-byte sequences, only
for three- and four-byte sequences.

I created a patch to fix this.  The difference to the original code is a
more unified way to increment `n'.  The original code incremented `n' by
one too much in the above scenario.  Tested on Cygwin with 2, 3, and 4
byte sequences.

This way I found another problem in the UTF-16 surrogate handling which
only occurs with incomplete 4-byte sequences.  I have no fix for this
and I'm not sure yet if there's a satisfying way to fix this at all.

So the below patch only fixes the above problem, not the UTF-16
related problem.

Ok to apply?

Please go ahead.

-- Jeff J.

Thanks,
Corinna
	* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Fix incrementing n
	in case of handling incomplete sequences.
Index: libc/stdlib/mbtowc_r.c =================================================================== RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v retrieving revision 1.13 diff -u -p -r1.13 mbtowc_r.c --- libc/stdlib/mbtowc_r.c 9 Apr 2009 08:20:10 -0000 1.13 +++ libc/stdlib/mbtowc_r.c 28 Jul 2009 10:01:20 -0000 @@ -220,11 +220,7 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch if (state->__count == 0) ch = t[i++]; else - { - if (n < (size_t)-1) - ++n; - ch = state->__value.__wchb[0]; - } + ch = state->__value.__wchb[0]; if (ch == '\0') { @@ -244,7 +240,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch { /* two-byte sequence */ state->__value.__wchb[0] = ch; - state->__count = 1; + if (state->__count == 0) + state->__count = 1; + else if (n < (size_t)-1) + ++n; if (n < 2) return -2; ch = t[i++]; @@ -288,7 +287,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch return -1; } state->__value.__wchb[1] = ch; - state->__count = 2; + if (state->__count == 1) + state->__count = 2; + else if (n < (size_t)-1) + ++n; if (n < 3) return -2; ch = t[i++]; @@ -347,7 +349,10 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch return -1; } state->__value.__wchb[2] = ch; - state->__count = 3; + if (state->__count == 2) + state->__count = 3; + else if (n < (size_t)-1) + ++n; if (n < 4) return -2; ch = t[i++];

Follow-Ups:
- Re: [PATCH] Bug in _mbtowc_r UTF-8 handling
  - From: Corinna Vinschen

References:
- [PATCH] Bug in _mbtowc_r UTF-8 handling
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]