This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0


https://sourceware.org/bugzilla/show_bug.cgi?id=17588

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Pravin S from comment #2)
> Created attachment 7958 [details]
> Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> 
> Mike did review on it earlir and done updates to glibc-i18n git.
> https://github.com/pravins/glibc-i18n
> 
> I have updated patch based on those improvement.
> 
> Latest report on backward compatibility is available AT
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8  
> 
> Note: Please file word Analysis, it is done after report is generated to
> make sure changes are correct.
> 
> Mike please review patch and give your comments.

To check whether the new generated UTF-8 file is correct,
I ran the utf8-compatibility.py script (updated version) like this:

python3 utf8-compatibility.py -o ../glibc/localedata/charmaps/UTF-8 -n UTF-8 
-u unicode7-0/UnicodeData.txt -e unicode7-0/EastAsianWidth.txt -c
Report on CHARMAP:
This character might be missing in the generated charmap:  <U9F80>..<U9FC3>
************************************************************

Report on WIDTH:
Total changed characters in newly generated WIDTH:  88827
changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN
...
changed width: 0xa960 : 1->2 eaw=W category=Lo bidi=L   name=HANGUL CHOSEONG
TIKEUT-MIEUM
...
many such lines
...

Now I look at these lines, for example the above mentioned change
where the width of a character changes from 1 to 2 and the character has
East Asian Width âWâ and the category âLoâ is certainly correct
(This character was not in the old UTF-8 file, only characters with
width 0 and 2 are in the file, 1 is the default width, every character
not in the UTF-8 file gets the default width 1).

As this change looks correct, I remove all lines like this from my Emacs
buffer with:

    âM-x flush-lines RET 1->2 eaw=W category=Loâ

Removing lines with obviously correct changes like this quickly
reduces the number of lines to look at and after a while I have only

changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN
changed width: 0x3248 : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER TEN
ON BLACK SQUARE
changed width: 0x3249 : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
TWENTY ON BLACK SQUARE
changed width: 0x324a : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
THIRTY ON BLACK SQUARE
changed width: 0x324b : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
FORTY ON BLACK SQUARE
changed width: 0x324c : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
FIFTY ON BLACK SQUARE
changed width: 0x324d : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
SIXTY ON BLACK SQUARE
changed width: 0x324e : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
SEVENTY ON BLACK SQUARE
changed width: 0x324f : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER
EIGHTY ON BLACK SQUARE

The change for the characters with eaw=A (East Asian Width
âAmbiguousâ) where the width changed from 2 to 1 is also correct, I think.
The UTF-8 file is a generic file, not especially for an East Asian locale,
so the âAmbiguousâ characters should not have width 2.

Then only the soft hyphen remains which puzzles me a bit:

changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN

Our script gives width 0 to this character because of category=Cf.

But the display width of the soft hyphen depends on whether it is
in the middle of a line (invisible then) or happens to be at the end
of a line where it should be visible (and doesnât it have a width greater
than zero if it is visible?).
But still giving width 0 to the soft hyphen in the UTF-8 file seems the
right thing to me.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]