This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0

From: Alexandre Oliva <aoliva at redhat dot com>
To: "Carlos O'Donell" <carlos at redhat dot com>
Cc: Pravin Satpute <psatpute at redhat dot com>, Siddhesh Poyarekar <siddhesh at redhat dot com>, Mike FABIAN <mfabian at redhat dot com>, libc-alpha at sourceware dot org, Jens Petersen <petersen at redhat dot com>
Date: Wed, 18 Feb 2015 21:23:45 -0200
Subject: Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
Authentication-results: sourceware.org; auth=none
References: <573624784 dot 8871393 dot 1416848051220 dot JavaMail dot zimbra at redhat dot com> <orzjb3o7yf dot fsf at free dot home> <s9dy4qir6fu dot fsf at ari dot site> <orfvce7y90 dot fsf at free dot home> <s9d388duu5r dot fsf at ari dot site> <orioh35mbq dot fsf at free dot home> <20141223111038 dot GA5172 at spoyarek dot pnq dot redhat dot com> <119234933 dot 5523688 dot 1422972847328 dot JavaMail dot zimbra at redhat dot com> <or7fvnlbeo dot fsf at livre dot home> <orwq3njuvc dot fsf at livre dot home> <54E23EC9 dot 5020400 at redhat dot com>

On Feb 16, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:

> On 02/12/2015 05:18 AM, Alexandre Oliva wrote:
>>> Regression tested on x86_64-linux-gnu.  Ok to install?

> Yes, this version is OK to install if you fix all the nits.

Thanks.

> Despite complaints that a change in the generator would create
> a smaller diff, that doesn't matter to me.

The script changes were small and I figured it wouldn't hurt to merge
them and reduce the diff, so I did.  So I'll wait for another ACK before
checking this in.

I also added the downloaded files to the tree, so that binary
distributors don't risk running afoul of the LGPL for lack of the .txt
files.  It's not clear that they would be required, but it doesn't hurt
to put them in.  I also added unicode-license.txt, copied from other
packages that ship it.  I couldn't find the text file for download from
unicode.org, though I admittedly didn't search very thoroughly.

> Nit: ChangeLog needs [BZ #xxx] etc.

*check*.  Heh, I didn't realize there were open bugs about this, in
spite of the mention in the Subject.  Doh!

> Nit: This covers bugs 17588, 13064, *AND* 14094.

*check*

> Nit: Needs a NEWS entry describing this in full glory :-)

* Character encoding and ctype tables were updated to Unicode 7.0.0, using
  new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red
  Hat).  These updates cause user visible changes, such as the fix for bug
  17998.

> Some might argue it fits better under "scripts" e.g. scripts/unicode-gen,
> but I don't care. We can move it later if we think it should move at all.

*nod*

>>> * unicode-gen/gen_unicode_ctype.py: New generator.

> Nit: Wrong copyright year e.g. 2014 -> 2015.

*check*.  I added ", 2015" after 2014 in the scripts.

> Nit: We don't use "Contributed by" statements, they are instead pat of what
>      git records as Author or in the git commit message.

*check*.  I removed them from the scripts, and added them as "from" in
the ChangeLog and in NEWS.

I also removed the links that pointed to github as upstream, since I
understand the GNU libc repository is going to hold the master copy, and
the repository that was linked to is thus obsolescent.

>>> * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
>>> true for ordinal indicators.

> Nit: This need a specific new BZ for the fix to user-visible behaviour.

*check*: [BZ# 17998]

Here's the header of the patch and the incremental changes to the
scripts, from the previously posted version.

The entire patch can be found in the lzip-compressed attachment.


for  localedata/ChangeLog

	[BZ #17588]
	[BZ #13064]
	[BZ #14094]
	[BZ #17998]
	* unicode-gen/Makefile: New.
	* unicode-gen/unicode-license.txt: New, from Unicode.
	* unicode-gen/UnicodeData.txt: New, from Unicode.
	* unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
	* unicode-gen/EastAsianWidth.txt: New, from Unicode.
	* unicode-gen/gen_unicode_ctype.py: New generator, from Mike
	FABIAN <mfabian@redhat.com>.
	* unicode-gen/ctype_compatibility.py: New verifier, from
	Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
	* unicode-gen/ctype_compatibility_test_cases.py: New verifier
	module, from Mike FABIAN.
	* unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
	and Mike FABIAN.
	* unicode-gen/utf8_compatibility.py: New verifier, from Pravin
	Satpute and Mike FABIAN.
	* charmaps/UTF-8: Update.
	* locales/i18n: Update.
	* gen-unicode-ctype.c: Remove.
	* tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
	true for ordinal indicators.
---
 NEWS                                               |   11 
 localedata/charmaps/UTF-8                          |11946 ++++++---
 localedata/gen-unicode-ctype.c                     |  784 -
 localedata/locales/i18n                            | 2652 +-
 localedata/tst-ctype-de_DE.ISO-8859-1.in           |    2 
 localedata/unicode-gen/DerivedCoreProperties.txt   |10794 ++++++++
 localedata/unicode-gen/EastAsianWidth.txt          | 2121 ++
 localedata/unicode-gen/Makefile                    |   99 
 localedata/unicode-gen/UnicodeData.txt             |27268 ++++++++++++++++++++
 localedata/unicode-gen/ctype_compatibility.py      |  546 
 .../unicode-gen/ctype_compatibility_test_cases.py  |  951 +
 localedata/unicode-gen/gen_unicode_ctype.py        |  751 +
 localedata/unicode-gen/unicode-license.txt         |   50 
 localedata/unicode-gen/utf8_compatibility.py       |  399 
 localedata/unicode-gen/utf8_gen.py                 |  286 
 15 files changed, 53278 insertions(+), 5382 deletions(-)
 delete mode 100644 localedata/gen-unicode-ctype.c
 create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt
 create mode 100644 localedata/unicode-gen/EastAsianWidth.txt
 create mode 100644 localedata/unicode-gen/Makefile
 create mode 100644 localedata/unicode-gen/UnicodeData.txt
 create mode 100755 localedata/unicode-gen/ctype_compatibility.py
 create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py
 create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py
 create mode 100644 localedata/unicode-gen/unicode-license.txt
 create mode 100755 localedata/unicode-gen/utf8_compatibility.py
 create mode 100755 localedata/unicode-gen/utf8_gen.py

diff --git a/NEWS b/NEWS
index 0501d51..a59b68d 100644
--- a/NEWS
+++ b/NEWS
@@ -9,8 +9,15 @@ Version 2.22
 
 * The following bugs are resolved with this release:
 
-  4719, 15319, 15467, 15790, 16560, 17569, 17792, 17912, 17932, 17944,
-  17949, 17964, 17965, 17967, 17969, 17978, 17987, 17991, 17996.
+  4719, 13064, 14094, 15319, 15467, 15790, 16560, 17569, 17588, 17792,
+  17912, 17932, 17944, 17949, 17964, 17965, 17967, 17969, 17978, 17987,
+  17991, 17996, 17998.
+
+* Character encoding and ctype tables were updated to Unicode 7.0.0, using
+  new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red
+  Hat).  These updates cause user visible changes, such as the fix for bug
+  17998.
+

 Version 2.21
 

Incremental changes to the scripts:

diff --git a/localedata/unicode-gen/ctype_compatibility.py b/localedata/unicode-gen/ctype_compatibility.py
index 9535f81..19e9ee5 100755
--- a/localedata/unicode-gen/ctype_compatibility.py
+++ b/localedata/unicode-gen/ctype_compatibility.py
@@ -1,10 +1,7 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
-# Contributed by
-#   Pravin Satpute <psatpute@redhat.com>, 2014.
-#   Mike FABIAN <mfabian@redhat.com>, 2014.
 #
 # The GNU C Library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
diff --git a/localedata/unicode-gen/ctype_compatibility_test_cases.py b/localedata/unicode-gen/ctype_compatibility_test_cases.py
index 09438d7..ab7f6dd 100644
--- a/localedata/unicode-gen/ctype_compatibility_test_cases.py
+++ b/localedata/unicode-gen/ctype_compatibility_test_cases.py
@@ -1,8 +1,6 @@
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
-# Contributed by
-#   Mike FABIAN <mfabian@redhat.com>, 2014.
 #
 # The GNU C Library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
index 24155bd..559af79 100755
--- a/localedata/unicode-gen/gen_unicode_ctype.py
+++ b/localedata/unicode-gen/gen_unicode_ctype.py
@@ -1,9 +1,8 @@
 #!/usr/bin/python3
 #
 # Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
-# Contributed by Mike FABIAN <maiku.fabian@gmail.com>, 2014.
 # Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
 #
 # The GNU C Library is free software; you can redistribute it and/or
diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
index 4928e3e..e11327b 100755
--- a/localedata/unicode-gen/utf8_compatibility.py
+++ b/localedata/unicode-gen/utf8_compatibility.py
@@ -1,9 +1,7 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
-# Contributed by Pravin Satpute <psatpute@redhat.com>, 2014.
-#                Mike FABIAN <mfabian@redhat.com>, 2014
 #
 # The GNU C Library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
@@ -27,8 +25,6 @@ To see how this script is used, call it with the â-hâ option:
 
     $ ./utf8_compatibility.py -h
     â prints usage message â
-
-For issues upstream https://github.com/pravins/glibc-i18n
 '''
 
 import sys
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 9ffb7f6..670a628 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,10 +1,7 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
-# Contributed by
-# Pravin Satpute <psatpute AT redhat DOT com> and
-# Mike Fabian <mfabian At redhat DOT com> - 2014
 #
 # The GNU C Library is free software; you can redistribute it and/or
 # modify it under the terms of the GNU Lesser General Public
@@ -28,13 +25,30 @@ from Unicode data.
 Usage: python3 utf8_gen.py UnicodeData.txt EastAsianWidth.txt
 
 It will output UTF-8 file
-
-For issues upstream https://github.com/pravins/glibc-i18n
 '''
 
 import sys
 import re
 
+# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
+# sections 3.11 and 4.4.
+
+jamo_initial_short_name = [
+    'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
+    'C', 'K', 'T', 'P', 'H'
+]
+
+jamo_medial_short_name = [
+    'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
+    'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
+]
+
+jamo_final_short_name = [
+    '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
+    'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
+    'P', 'H'
+]
+
 def ucs_symbol(code_point):
     '''Return the UCS symbol string for a Unicode character.'''
     if code_point < 0x10000:
@@ -57,8 +71,15 @@ def process_range(start, end, outfile, name):
         #
         # So we expand the Hangul Syllables here:
         for i in range(int(start, 16), int(end, 16)+1 ):
-            outfile.write('{:s}     {:s} {:s}\n'.format(
-                ucs_symbol(i), convert_to_hex(i), name))
+            index2, index3 = divmod(i - 0xaC00, 28)
+            index1, index2 = divmod(index2, 21)
+            hangul_syllable_name = 'HANGUL SYLLABLE ' \
+                                   + jamo_initial_short_name[index1] \
+                                   + jamo_medial_short_name[index2] \
+                                   + jamo_final_short_name[index3]
+            outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+                ucs_symbol(i), convert_to_hex(i),
+                hangul_syllable_name))
         return
     # UnicodeData.txt file has contains code point ranges like this:
     #
@@ -73,13 +94,13 @@ def process_range(start, end, outfile, name):
     # <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>
     for i in range(int(start, 16), int(end, 16), 64 ):
         if i > (int(end, 16)-64):
-            outfile.write('{:s}..{:s}     {:s} {:s}\n'.format(
+            outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
                     ucs_symbol(i),
                     ucs_symbol(int(end,16)),
                     convert_to_hex(i),
                     name))
             break
-        outfile.write('{:s}..{:s}     {:s} {:s}\n'.format(
+        outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
                 ucs_symbol(i),
                 ucs_symbol(i+63),
                 convert_to_hex(i),
@@ -146,7 +167,7 @@ def process_charmap(flines, outfile):
             # the original UTF-8 file in glibc had them as
             # comments, so we keep these comment lines.
             outfile.write('%')
-        outfile.write('{:s}     {:s} {:s}\n'.format(
+        outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 ucs_symbol(int(fields[0], 16)),
                 convert_to_hex(int(fields[0], 16)),
                 fields[1]))

Attachment: unicode7-update-and-scripts.patch.lz
Description: Binary data

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

Follow-Ups:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Mike Frysinger
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Joseph Myers
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Carlos O'Donell

References:
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Pravin Satpute
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Alexandre Oliva
- Re: [PATCH] [BZ 17588 13064] Update UTF-8 charmap and width to Unicode 7.0.0
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]