This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/10093] New: iconv accepts UTF-8-encoded UTF-16 surrogates


According to 'man utf-8':

| The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
| and 0xffff (UCS non-characters) should not appear in  conforming  UTF-8
|?streams.

This is confirmed by RFC2279:
| The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
| obtained from the above, in principle, by simply extending each
| UCS-2 character with two zero-valued octets.  However, pairs of
| UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
| parlance), being actually UCS-4 characters transformed through
| UTF-16, need special treatment: the UTF-16 transformation must be
| undone, yielding a UCS-4 character that is then transformed as
| above.

However the following code shows however that iconv accepts suchs invalid 
characters:

$ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45 
$ for e in UTF-8 UTF-16 UTF-32 UCS-4 ; do printf "$e\t" ; printf $s | iconv -f 
UTF-8 -t $e > /dev/null && printf 'OK\n' ; done
UTF-8   OK
UTF-16  iconv: illegal input sequence at position 0
UTF-32  iconv: illegal input sequence at position 0
UCS-4   OK

-- 
           Summary: iconv accepts UTF-8-encoded UTF-16 surrogates
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: drepper at redhat dot com
        ReportedBy: aurelien at aurel32 dot net
                CC: glibc-bugs at sources dot redhat dot com
 GCC build triplet: x86_64-unknown-linux-gnu
  GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu


http://sourceware.org/bugzilla/show_bug.cgi?id=10093

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]