[PATCH] Fix UTF-16 surrogate handling in __utf8_mbtowc

Tue Jul 28 19:59:00 GMT 2009

Corinna Vinschen wrote:
> Hi,
>
>
> here's a fix for the UTF-16 surrogate pair handling in __utf8_mbtowc,
> as mentioned in http://sourceware.org/ml/newlib/2009/msg00778.html.
> The original code only worked in the context of application calls to
> mbs[nr]towcs.  The new code below should also work in most cases where
> the application calls mbrtowc by itself.
>
> What it does now is this.  After three bytes of the 4-byte input
> sequence have been read, it checks if we're on a sizeof(wchar_t) == 2
> system.  If so, it extracts the first surrogate pair from these three
> bytes, since three bytes of UTF-8 input are sufficient to generate the
> first surrogate half.  If the input is valid (I added a comment to
> explain what an invalid sequence is in this context), the return code is
> set to the actual number of input bytes, rather than the fixed value of
> 2, which is one reason this only worked when called from mbs[nr]towcs.
> To mark this as the first surrogate half, the state->__count field is
> set to 4.
>
> This conditional expression isn't true if state->__count is 4, rather
> the code goes along as on sizeof(wchar_t) == 4 systems.  Just at the
> end, if state->__count is 4, *pwc is computed as the second surrogate
> half, while on sizeof(wchar_t) == 4 systems the entire Unicode value is
> written to *pwc.
>
> The downside of this implementation is that an application could be
> happy with the result after only having read the first three bytes
> of the four byte sequence from the input string and just stop.  This
> results in an incomplete surrogate pair.  However, as far as I can see
> it's rather unlikely, and it's still better that not handling Unicode
> values outside the base plane at all.
>
> Tested on Cygwin.
>
> Ok to apply?
>
>
>   
Please go ahead.

-- Jeff J.

> Thanks,
> Corinna
>
>
> 	* libc/stdlib/mbtowc_r.c (__utf8_mbtowc): Rework UTF-16 surrogate
> 	pair handling to be more bullet-proof even with incomplete UTF-8
> 	sequences.
>
>
> Index: libc/stdlib/mbtowc_r.c
> ===================================================================
> RCS file: /cvs/src/src/newlib/libc/stdlib/mbtowc_r.c,v
> retrieving revision 1.14
> diff -u -p -r1.14 mbtowc_r.c
> --- libc/stdlib/mbtowc_r.c	28 Jul 2009 16:49:19 -0000	1.14
> +++ libc/stdlib/mbtowc_r.c	28 Jul 2009 16:52:48 -0000
> @@ -205,18 +205,6 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
>    if (n == 0)
>      return -2;
>  
> -  if (state->__count == 4)
> -    {
> -      /* Create the second half of the surrogate pair.  For a description
> -	 see the comment below. */
> -      wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> -	|   (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> -	|   (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> -	|   (wchar_t)(state->__value.__wchb[3] & 0x3f);
> -      state->__count = 0;
> -      *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
> -      return 2;
> -    }
>    if (state->__count == 0)
>      ch = t[i++];
>    else
> @@ -353,6 +341,36 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
>  	state->__count = 3;
>        else if (n < (size_t)-1)
>  	++n;
> +      if (state->__count == 3 && sizeof(wchar_t) == 2)
> +	{
> +	  /* On systems which have wchar_t being UTF-16 values, the value
> +	     doesn't fit into a single wchar_t in this case.  So what we
> +	     do here is to store the state with a special value of __count
> +	     and return the first half of a surrogate pair.  The first
> +	     three bytes of a UTF-8 sequence are enough to generate the
> +	     first half of a UTF-16 surrogate pair.  As return value we
> +	     choose to return the number of bytes actually read up to
> +	     here.
> +	     The second half of the surrogate pair is returned in case we
> +	     recognize the special __count value of four, and the next
> +	     byte is actually a valid value.  See below*/
> +	  tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
> +	    |   (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
> +	    |   (wint_t)((state->__value.__wchb[2] & 0x3f) << 6);
> +	  tmp = (tmp - 0x10000) >> 10;
> +	  /* Check if the sequence can fit into a surrogate pair at all.
> +	     If tmp is > 0x3ff at this point, the full Unicode value will
> +	     be > 0x10ffff.  This is an invalid Unicode value and outside
> +	     of the defintion of UTF-16 surrogate pairs. */
> +	  if (tmp > 0x3ff)
> +	    {
> +	      r->_errno = EILSEQ;
> +	      return -1;
> +	    }
> +	  state->__count = 4;
> +	  *pwc = 0xd800 | tmp;
> +	  return i;
> +	}
>        if (n < 4)
>  	return -2;
>        ch = t[i++];
> @@ -365,21 +383,14 @@ _DEFUN (__utf8_mbtowc, (r, pwc, s, n, ch
>  	|   (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
>  	|   (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
>  	|   (wint_t)(ch & 0x3f);
> -      if (tmp > 0xffff && sizeof(wchar_t) == 2)
> +      if (state->__count == 4 && sizeof(wchar_t) == 2)
>  	{
> -	  /* On systems which have wchar_t being UTF-16 values, the value
> -	     doesn't fit into a single wchar_t in this case.  So what we
> -	     do here is to store the state with a special value of __count
> -	     and return the first half of a surrogate pair.  As return
> -	     value we choose to return the half of the actual UTF-8 char.
> -	     The second half is returned in case we recognize the special
> -	     __count value above. */
> -	  state->__value.__wchb[3] = ch;
> -	  state->__count = 4;
> -	  *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
> -	  return 2;
> +	  /* Create the second half of the surrogate pair for systems with
> +	     wchar_t == UTF-16 . */
> +	  *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
>  	}
> -      *pwc = tmp;
> +      else
> +	*pwc = tmp;
>        state->__count = 0;
>        return i;
>      }
>
>
>