Hi,
there's a bug in __utf8_mbtowc, which appears to be in that code for
ages. If you give an UTF-8 multibyte character to this function one
at a time, it screws up counting and testing the value `n', the number
of bytes in the input stream.
Here's a testcase:
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
int main(void) {
wchar_t wc;
size_t ret;
mbstate_t s = { 0 };
puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
printf("%i\n", mbrtowc(&wc, "\xe2", 1, &s));
printf("%i\n", mbrtowc(&wc, "\x94", 1, &s));
printf("%i\n", mbrtowc(&wc, "\x84", 1, &s));
printf("%x\n", wc);
return 0;
}
That's the expected output:
en_GB.UTF-8
-2
-2
1
2504
That's what happens with newlib's implementation:
en_GB.UTF-8
-2
-1
-1
<arbitrary value>
The sequence E2 94 84 should translate to U+2504. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:
printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));
As I mentioned afore, the problem is how the argument `n' is counted
when `state' is already filled and subsequent calls encounter a
non-empty state. As soon as the incoming byte sequence consists of
a multibyte followup sequence, and this sequence is in itself incomplete
again, the value of `n' is wrong and the tests which are supposed to
return -2 fail. The code slips through, reading a char too much, and
either fails, or, worst case, returns an incorrect wchar_t value.
Given that, the problem never happens for two-byte sequences, only
for three- and four-byte sequences.
I created a patch to fix this. The difference to the original code is a
more unified way to increment `n'. The original code incremented `n' by
one too much in the above scenario. Tested on Cygwin with 2, 3, and 4
byte sequences.
This way I found another problem in the UTF-16 surrogate handling which
only occurs with incomplete 4-byte sequences. I have no fix for this
and I'm not sure yet if there's a satisfying way to fix this at all.
So the below patch only fixes the above problem, not the UTF-16
related problem.
Ok to apply?