This is the mail archive of the libc-alpha@sourceware.cygnus.com mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

fixes for UTF-8 decoder

To: libc-alpha at sourceware dot cygnus dot com
Subject: fixes for UTF-8 decoder
From: Bruno Haible <haible at ilog dot fr>
Date: Fri, 21 Jan 2000 19:06:25 +0100 (MET)
Cc: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>


Hi Ulrich,

There are two problems with the UTF-8 decoder in iconv/gconv_simple.c.

Bug 1: The "break;" statement inside the "for (i = 1; i < cnt; ++i)" loop
terminates only that loop, whereas it should terminate the outer loop.

Bug 2: The UTF-8 encoder currently accepts multibyte representations that
are longer than necessary. RFC 2279 says that "It is important to note
that ... there is only one valid way to encode a given UCS-4 character."
As Markus Kuhn pointed out, this could some day become security relevant:
if some malformed UTF-8 sequence would, after UTF-8 -> UCS-4 conversion,
exhibit special ASCII characters (like backquote, slash, escape) but before
the UTF-8 -> UCS-4 conversion they were not visible, many programs could
become vulnerable.

The appended patch fixes both.

Bruno


        * iconv/gconv_simple.c (utf8_internal_loop): Reject invalid UTF-8
          input.

*** glibc/iconv/gconv_simple.c.bak	Sun Apr 25 20:06:02 1999
--- glibc/iconv/gconv_simple.c	Fri Jan 21 01:14:39 2000
***************
*** 255,262 ****
        }									      \
      else								      \
        {									      \
! 	if ((ch & 0xe0) == 0xc0)					      \
  	  {								      \
  	    cnt = 2;							      \
  	    ch &= 0x1f;							      \
  	  }								      \
--- 255,265 ----
        }									      \
      else								      \
        {									      \
! 	if (ch >= 0xc2 && ch < 0xe0)					      \
  	  {								      \
+ 	    /* We expect two bytes.  The first byte cannot be 0xc0 or 0xc1,   \
+ 	       otherwise the wide character could have been represented	      \
+ 	       using a single byte.  */					      \
  	    cnt = 2;							      \
  	    ch &= 0x1f;							      \
  	  }								      \
***************
*** 304,318 ****
  	    uint32_t byte = inptr[i];					      \
  									      \
  	    if ((byte & 0xc0) != 0x80)					      \
! 	      {								      \
! 		/* This is an illegal encoding.  */			      \
! 		result = GCONV_ILLEGAL_INPUT;				      \
! 		break;							      \
! 	      }								      \
  									      \
  	    ch <<= 6;							      \
  	    ch |= byte & 0x3f;						      \
  	  }								      \
  	inptr += cnt;							      \
        }									      \
  									      \
--- 307,329 ----
  	    uint32_t byte = inptr[i];					      \
  									      \
  	    if ((byte & 0xc0) != 0x80)					      \
! 	      /* This is an illegal encoding.  */			      \
! 	      break;							      \
  									      \
  	    ch <<= 6;							      \
  	    ch |= byte & 0x3f;						      \
  	  }								      \
+ 									      \
+ 	/* If i < cnt, some trail byte was not >= 0x80, < 0xc0.		      \
+ 	   If cnt > 2 and ch < 2^(5*cnt-4), the wide character ch could	      \
+ 	   have been represented with fewer than cnt bytes.  */		      \
+ 	if (i < cnt || (cnt > 2 && (ch >> (5 * cnt - 4)) == 0))		      \
+ 	  {								      \
+ 	    /* This is an illegal encoding.  */				      \
+ 	    result = GCONV_ILLEGAL_INPUT;				      \
+ 	    break;							      \
+ 	  }								      \
+ 									      \
  	inptr += cnt;							      \
        }									      \
  									      \

Follow-Ups:
- Re: fixes for UTF-8 decoder
  - From: Ulrich Drepper

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]