This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.


Hi,

when converting characters from UTF-16 to UTF-32 and the byte-sequence contains a single low-UTF-16-surrogate (0xdc00 .. 0xdfff), then iconv()
reports an error "invalid multibyte sequence".

Due to this requirement, the s390 hardware-instructions for converting from UTF-16 to UTF-8 / UTF-32 were disabled, because they do not report this error.

When converting from UTF-32 to UTF-8 / UTF-16, the s390 hardware-instructions do not report an error, if an UTF-32 character is in the range of a UTF16-low-surrogate (0xdc00 .. 0xdfff).
Should iconv() report the error "invalid multibyte sequence" in such cases?
If yes, then these two hardware instructions have to be disabled, too!

As comparison, the common-code does not report an error on such a low-surrogates character while converting from UTF-32 to INTERNAL and from INTERNAL to UTF-8.

In the other direction from UTF-8 to INTERNAL, characters in the range of a UTF-16 surrogate are not accepted and iconv returns the error "invalid multibyte sequence". The same behaviour when converting from INTERNAL to UTF-32.

According to the comment
"/* Surrogate characters in UCS-4 input are not valid. We must catch this. If we let surrogates pass through, attackers could make a security hole exploit by generating "irregular UTF-32" sequences. */"
in utf-32.c, this is a security issue.
What is the reason for reporting an error in the direction from UTF-8 to UTF-32, but not in the direction from UTF-32 to UTF-8?
Or is it a bug?


According to the latest Unicode Standard, an error should be reported in all cases:
See http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf
in chapter 3.9 Unicode Encoding Forms:
"D76    Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.
â As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF 16 and E000 16 to 10FFFF 16, inclusive.

D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form. â Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed.

UTF-32: D90: ...
â Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0000D800 16 ..0000DFFF 16 are ill-formed.

UTF-16: D91: ...
â Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D800 16 ..DFFF 16 are ill-formed.

UTF-8: D92: ...
â Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

Encoding Form Conversion: D93: ...
â A conformant encoding form conversion will treat any ill-formed code unit sequence as an error condition. (See conformance clause C10.) This guarantees that it will neither interpret nor emit an ill-formed code unit sequence. Any implementation of encoding form conversion must take this requirement into account, because an encoding form conversion implicitly involves a verification that the Unicode strings being converted do, in fact, contain well-formed code unit sequences."


There is a further issue in utf-16.c when converting from UTF-16 to internal. If an uint16_t value is in the range of 0xd800 .. 0xdfff, the next uint16_t value is checked, if it is in the range of a low surrogate (0xdc00 .. 0xdfff). Afterwards these two uint16_t values are interpreted as a high- and low-surrogates pair. But there is no test if the first uint16_t value is really in the range of a high-surrogate (0xd800 .. 0xdbff). If there would be two uint16_t values in the range of a low surrogate, then they will be treated as a valid high- and low-surrogates pair.
Should iconv() report the error "invalid multibyte sequence" in such a case?

Bye
Stefan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]