This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

What is the intended bahaviour of recoding characters outside the target range?


Hi guys,

What happens (and what should happen) when I try to recode a character
existing in one encoding but missing in another?

This has two sides:

1. What should happen when I try to recode a file with iconv 
   for example?
2. What should happen when I try to use the file via the glibc library
   in a different locale with a different encoding?

Lets take the following example:
The file "plentitude" is encoded in UTF8 and contains the following
characters:
======
Î×ÐØáÔ
======

These are as follows:

U+0391 GREEK CAPITAL LETTER ALPHA
U+05D0 HEBREW LETTER ALEF
U+0410 CYRILLIC CAPITAL LETTER A
U+0627 ARABIC LETTER ALEF
U+10A0 GEORGIAN CAPITAL LETTER AN
U+0531 ARMENIAN CAPITAL LETTER AYB

If I try to convert it to some 8 bit encoding most probably at least one
of the characters will be missing. Sometimes all of them - for example
the C encoding which is ASCII.

In such cases:

1. What I get when trying to convert the file via iconv is usually:
iconv -f UTF-8 -t ASCII plentitude
iconv: illegal input sequence at position 0
or some other position, then iconv exits and the file is not recoded in
its entirety.

2. However when I change my locale to C
export LANG=C
I can still cat the file


Now - suppose the behavior of iconv corresponds to the following:

The unicode standard states:
http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf
Section 5.3 Unknown and Missing Characters
Reserved and Private-Use Character Codes
=================================================
An implementation should not attempt to interpret such code points.
However, in practice, applications must deal with unassigned code points
or private use characters. This may occur, for example, when the
application is handling text that originated on a system implementing a
later release of the Unicode Standard, with additional assigned
characters.
Options for rendering such unknown code points include printing the code
point as four to six hexadecimal digits, printing a black or white box,
using appropriate glyphs such as à for reserved and | for private use,
or simply displaying nothing. An implementation should not blindly
delete such characters, nor should it unintentionally transform them
into something else.
==================================================

Since iconv is unable to print boxes or cannot ignore characters, and
simply deleting them or transforming them to another character is
unacceptable - it simply fails the converting operation.

Is there a rule stating that a non existing character should be recoded
to some symbol, or should the conversion fail in such cases?
For example this
http://www.unicode.org/glossary/#replacement_character
states
Character used as a substitute for an uninterpretable character from
another encoding. The Unicode Standard uses U+FFFD REPLACEMENT CHARACTER
for this function.
ï
However - in most encodings this character does not exist.

What should be the behavior of programs trying to do such a conversion?

Kind regards:
al_shopov


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]