This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r

From: Corinna Vinschen <vinschen at redhat dot com>
To: newlib at sourceware dot org
Date: Tue, 24 Feb 2009 10:26:40 +0100
Subject: Re: [PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r
References: <20090218155514.GB29052@calimero.vinschen.de>
Reply-to: newlib at sourceware dot org

Ping?

On Feb 18 16:55, Corinna Vinschen wrote:
> Hi,
> 
> 
> below is a patch which adds handling of UTF-16 surrogate pairs on
> systems which define wchar_t as two byte values.  Unfortunately the
> POSIX functions wcrtomb and mbrtowc don't define surrogate handling at
> all because POSIX assumes that wchar_t is big enough to hold an entire
> wide char under all circumstances.  The problem is that this assumption
> actually breaks these functions for all Unicode chars beyond 0xffff,
> which is quite a lot of chars.
> 
> Given that Cygwin is based on Windows and Windows is a wchar_t == UTF-16
> system, that breaks Cygwin in conjunction with a significant number of
> languages.  That's why I created the below patch which is, quite
> certainly, a hack based on the lack of the underlying system.
> 
> How the patch works on UTF-16 systems:
> 
> - _wctomb_r: If a first half of a surrogate pair is detected in wchar,
>   it creates a temporary wint_t value based on the 10 value bits in the
>   surrogate wchar_t.  This value is then stored in state, and the first
>   byte of the resulting UTF-8 char is returned.  If a second half of a
>   surrogate pair is detected, _wctomb_r checks if it already detected a
>   first half in the previous run.  If not, it's an invalid wchar value.
>   Otherwise it creates the full Unicode value, resets the state, and
>   returns the trailing 3 UTF-8 bytes in s.
> 
> - _mbtowc_r: If the detected UTF-8 char results in a Unicode char in the
>   range from 0x10000 <= unicode_char <= 0x10ffff, it stores the value in
>   state and returns the first surrogate UTF-8 value.  In the next call,
>   if the state indicates that we're in the middle of a surrogated char,
>   it resets the state, and returns the second half of the surrogate
>   pair.
> 
> This *might* break applications on UTF-16 systems which are ignorant of
> the fact that wchar_t doesn't hold a complete Unicode char, *and* use
> wcrtomb/mbrtowc directly.  However, most applications will use the
> higher level string functions (wcstombs/mbstowcs), and these are using
> _wctomb_r/_mbtowc_r transparently from the application's point of view.
> 
> So, the bottom line is, I'm not entirely sure if that's a good idea in
> all cases, but IMHO the advantages outweigh the potential problems.
> 
> Btw., the patch for _mbtowc_r also fixes two compiler warnings.
> 
> 
> Corinna
> 
> 
> 	* mbtowc_r.c (_mbtowc_r): Fix two compiler warnings.
> 	Handle surrogate pairs in case of wchar_t == UTF-16.
> 	* wctomb_r.c (_wctomb_r): Handle surrogate pairs in case of
> 	wchar_t == UTF-16.
> 
> 
> --- mbtowc_r.c-UNI	2009-02-18 10:02:35.000000000 +0100
> +++ mbtowc_r.c	2009-02-18 16:22:41.000000000 +0100
> @@ -65,8 +65,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
>      return -2;
>  
>  #ifdef _MB_CAPABLE
> -  if (__lc_ctype == NULL ||
> -      (strlen (__lc_ctype) <= 1))
> +  if ((strlen (__lc_ctype) <= 1))
>      { /* fall-through */ }
>    else if (!strcmp (__lc_ctype, "C-UTF-8"))
>      {
> @@ -76,6 +75,18 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
>        if (s == NULL)
>          return 0; /* UTF-8 character encodings are not state-dependent */
>  
> +      if (state->__count == 4)
> +	{
> +	  /* Create the second half of the surrogate pair.  For a description
> +	     see the comment below. */
> +	  wint_t tmp = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> +	    |   (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> +	    |   (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> +	    |   (wchar_t)(state->__value.__wchb[3] & 0x3f);
> +	  state->__count = 0;
> +	  *pwc = 0xdc00 | ((tmp - 0x10000) & 0x3ff);
> +	  return 2;
> +	}
>        if (state->__count == 0)
>  	ch = t[i++];
>        else
> @@ -153,8 +164,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
>        else if (ch >= 0xf0 && ch <= 0xf7)
>  	{
>  	  /* four-byte sequence */
> -	  if (sizeof(wchar_t) < 4)
> -	    return -1; /* we can't store such a value */
> +	  wint_t tmp;
>  	  state->__value.__wchb[0] = ch;
>  	  if (state->__count == 0)
>  	    state->__count = 1;
> @@ -185,11 +195,25 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
>  	  ch = t[i++];
>  	  if (ch < 0x80 || ch > 0xbf)
>  	    return -1;
> -	  *pwc = (wchar_t)((state->__value.__wchb[0] & 0x07) << 18)
> -	    |    (wchar_t)((state->__value.__wchb[1] & 0x3f) << 12)
> -	    |    (wchar_t)((state->__value.__wchb[2] & 0x3f) << 6)
> -	    |    (wchar_t)(ch & 0x3f);
> -	
> +	  tmp = (wint_t)((state->__value.__wchb[0] & 0x07) << 18)
> +	    |   (wint_t)((state->__value.__wchb[1] & 0x3f) << 12)
> +	    |   (wint_t)((state->__value.__wchb[2] & 0x3f) << 6)
> +	    |   (wint_t)(ch & 0x3f);
> +	  if (tmp > 0xffff && sizeof(wchar_t) == 2)
> +	    {
> +	      /* On systems which have wchar_t being UTF-16 values, the value
> +		 doesn't fit into a single wchar_t in this case.  So what we
> +		 do here is to store the state with a special value of __count
> +		 and return the first half of a surrogate pair.  As return
> +		 value we choose to return the half of the actual UTF-8 char.
> +		 The second half is returned in case we recognize the special
> +		 __count value above. */
> +	      state->__value.__wchb[3] = ch;
> +	      state->__count = 4;
> +	      *pwc = 0xd800 | (((tmp - 0x10000) >> 10) & 0x3ff);
> +	      return 2;
> +	    }
> +	  *pwc = tmp;
>  	  state->__count = 0;
>  	  return i;
>  	}
> @@ -330,7 +354,7 @@ _DEFUN (_mbtowc_r, (r, pwc, s, n, state)
>                *pwc = (((wchar_t)state->__value.__wchb[0]) << 8) + (wchar_t)(t[i]);
>                return (i + 1);
>              case MAKE_A:
> -              ptr = (char *)(t + i + 1);
> +              ptr = (unsigned char *)(t + i + 1);
>                break;
>              case ERROR:
>              default:
> --- wctomb_r.c-UNI	2009-02-18 10:02:26.000000000 +0100
> +++ wctomb_r.c	2009-02-18 16:11:46.000000000 +0100
> @@ -28,6 +28,11 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
>        if (s == NULL)
>          return 0; /* UTF-8 encoding is not state-dependent */
>  
> +      if (state->__count == -4 && (wchar < 0xdc00 || wchar >= 0xdfff))
> +	{
> +	  /* At this point only the second half of a surrogate pair is valid. */
> +	  return -1;
> +	}
>        if (wchar <= 0x7f)
>          {
>            *s = wchar;
> @@ -41,10 +46,39 @@ _DEFUN (_wctomb_r, (r, s, wchar, state),
>          }
>        else if (wchar >= 0x800 && wchar <= 0xffff)
>          {
> -          /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
>            if (wchar >= 0xd800 && wchar <= 0xdfff)
> -            return -1;
> -
> +	    {
> +	      wint_t tmp;
> +	      /* UTF-16 surrogates -- must not occur in normal UCS-4 data */
> +	      if (sizeof (wchar_t) != 2)
> +		return -1;
> +	      if (wchar >= 0xdc00)
> +		{
> +		  /* Second half of a surrogate pair. It's not valid if
> +		     we don't have already read a first half of a surrogate
> +		     before. */
> +		  if (state->__count != -4)
> +		    return -1;
> +		  /* If it's valid, reconstruct the full Unicode value and
> +		     return the trailing three bytes of the UTF-8 char. */
> +		  tmp = (state->__value.__wchb[0] << 16)
> +			| (state->__value.__wchb[1] << 8)
> +			| (wchar & 0x3ff);
> +		  state->__count = 0;
> +		  *s++ = 0x80 | ((tmp &  0x3f000) >> 12);
> +		  *s++ = 0x80 | ((tmp &    0xfc0) >> 6);
> +		  *s   = 0x80 |  (tmp &     0x3f);
> +		  return 3;
> +	      	}
> +	      /* First half of a surrogate pair.  Store the state and return
> +	         the first byte of the UTF-8 char. */
> +	      tmp = ((wchar & 0x3ff) << 10) + 0x10000;
> +	      state->__value.__wchb[0] = (tmp >> 16) & 0xff;
> +	      state->__value.__wchb[1] = (tmp >> 8) & 0xff;
> +	      state->__count = -4;
> +	      *s = (0xf0 | ((tmp & 0x1c0000) >> 18));
> +	      return 1;
> +	    }
>            *s++ = 0xe0 | ((wchar & 0xf000) >> 12);
>            *s++ = 0x80 | ((wchar &  0xfc0) >> 6);
>            *s   = 0x80 |  (wchar &   0x3f);
> 
> -- 
> Corinna Vinschen
> Cygwin Project Co-Leader
> Red Hat

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

Follow-Ups:
- Re: [PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r
  - From: Jeff Johnston

References:
- [PATCH] Handle surrogate pairs in _wctomb_r/_mbtowc_r
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]