[PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc
Jeff Johnston
jjohnstn@redhat.com
Tue Jul 28 19:59:00 GMT 2009
Corinna Vinschen wrote:
> Hi,
>
>
> here's a fix for the UTF-16 surrogate pair handling in __utf8_mbtowc,
> as mentioned in http://sourceware.org/ml/newlib/2009/msg00778.html.
> The original code only worked in the context of application calls to
> mbs[nr]towcs. The new code below should also work in most cases where
> the application calls mbrtowc by itself.
>
> What it does now is this. After three bytes of the 4-byte input
> sequence have been read, it checks if we're on a sizeof(wchar_t) == 2
> system. If so, it extracts the first surrogate pair from these three
> bytes, since three bytes of UTF-8 input are sufficient to generate the
> first surrogate half. If the input is valid (I added a comment to
> explain what an invalid sequence is in this context), the return code is
> set to the actual number of input bytes, rather than the fixed value of
> 2, which is one reason this only worked when called from mbs[nr]towcs.
> To mark this as the first surrogate half, the state->__count field is
> set to 4.
>
> This conditional expression isn't true if state->__count is 4, rather
> the code goes along as on sizeof(wchar_t) == 4 systems. Just at the
> end, if state->__count is 4, *pwc is computed as the second surrogate
> half, while on sizeof(wchar_t) == 4 systems the entire Unicode value is
> written to *pwc.
>
> The downside of this implementation is that an application could be
> happy with the result after only having read the first three bytes
> of the four byte sequence from the input string and just stop. This
> results in an incomplete surrogate pair. However, as far as I can see
> it's rather unlikely, and it's still better that not handling Unicode
> values outside the base plane at all.
>
> Tested on Cygwin.
>
> Ok to apply?
>
>
>
Please go ahead.
-- Jeff J.
> Thanks,
> Corinna
>
>
> * libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Rework UTF-16 surrogate
> pair handling to be more bullet-proof even with incomplete UTF-8
> sequences.
>
>
> Index: libc/stdlib/mbtowc_r.c
> ===================================================================
> RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
> retrieving revision 1.14
> diff -u -p -r1.14 mbtowc_r.c
> --- libc/stdlib/mbtowc_r.c 28 Jul 2009 16:49:19 -0000 1.14
> +++ libc/stdlib/mbtowc_r.c 28 Jul 2009 16:52:48 -0000
> @@ -205,18 +205,6 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
> if (n == 0)
> return -2;
>
> - if (state->__count == 4)
> - {
> - /* Create the second half of the surrogate pair. For a description
> - see the comment below. */
> - wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> - | (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> - | (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> - | (wchar_t)(state->__value.__wchb[3] & 0x3f);
> - state->__count = 0;
> - *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
> - return 2;
> - }
> if (state->__count == 0)
> ch = t[i++];
> else
> @@ -353,6 +341,36 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
> state->__count = 3;
> else if (n < (size_t)-1)
> ++n;
> + if (state->__count == 3 && sizeof(wchar_t) == 2)
> + {
> + /* On systems which have wchar_t being UTF-16 values, the value
> + doesn't fit into a single wchar_t in this case. So what we
> + do here is to store the state with a special value of __count
> + and return the first half of a surrogate pair. The first
> + three bytes of a UTF-8 sequence are enough to generate the
> + first half of a UTF-16 surrogate pair. As return value we
> + choose to return the number of bytes actually read up to
> + here.
> + The second half of the surrogate pair is returned in case we
> + recognize the special __count value of four, and the next
> + byte is actually a valid value. See below*/
> + tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
> + | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
> + | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6);
> + tmp = (tmp - 0x10000) >> 10;
> + /* Check if the sequence can fit into a surrogate pair at all.
> + If tmp is > 0x3ff at this point, the full Unicode value will
> + be > 0x10ffff. This is an invalid Unicode value and outside
> + of the defintion of UTF-16 surrogate pairs. */
> + if (tmp > 0x3ff)
> + {
> + r->_errno = EILSEQ;
> + return -1;
> + }
> + state->__count = 4;
> + *pwc = 0xd800 | tmp;
> + return i;
> + }
> if (n < 4)
> return -2;
> ch = t[i++];
> @@ -365,21 +383,14 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
> | (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
> | (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
> | (wint_t)(ch & 0x3f);
> - if (tmp > 0xffff && sizeof(wchar_t) == 2)
> + if (state->__count == 4 && sizeof(wchar_t) == 2)
> {
> - /* On systems which have wchar_t being UTF-16 values, the value
> - doesn't fit into a single wchar_t in this case. So what we
> - do here is to store the state with a special value of __count
> - and return the first half of a surrogate pair. As return
> - value we choose to return the half of the actual UTF-8 char.
> - The second half is returned in case we recognize the special
> - __count value above. */
> - state->__value.__wchb[3] = ch;
> - state->__count = 4;
> - *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
> - return 2;
> + /* Create the second half of the surrogate pair for systems with
> + wchar_t == UTF-16 . */
> + *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
> }
> - *pwc = tmp;
> + else
> + *pwc = tmp;
> state->__count = 0;
> return i;
> }
>
>
>
More information about the Newlib
mailing list