This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 14094] Update locale data to Unicode 7.0.0


On Sun, 22 Jun 2014, Pravin Satpute wrote:

> Hi All,
> 
>   Attached patch to fix this long pending issue.

Thanks for working on this issue.

>  A.  Process for updating locales/i18n ctype with new Unicode release is
> documented @ [1], I think it should get added either in WIKI, or docs
> folder of glibc.

The process should ideally be running a single command - no manual editing 
at all.  (That command might be a script that wraps some other commands.)  
If tempted to write instructions for running a sequence of commands and 
editing the result, writing a script to automate that is better.

>  B. Patch adds two scripts to scripts folder and updates locales/i18n file
>      gen-unicode-ctype-dcp.py - To generate upper, lower and alpha class
> from DerivedCoreProperties.txt [2]
>      check-backcompatibility.py - to test whether updated locales/i18n
> is backward compatible with older one.
> C.  Best way to check new updated i18n file is compatible with existing
> i18n file is
>      a. copy new i18n file as a i18nnew
>      b. check-backcompatibility.py i18n i18nnew > Report
> D.  By using better file DerivedCoreProperties.txt from UCD for
> generating CTYPE, we found number of characters were improperly mapped
> to 'alpha' categories.
>      Report/Analysis for backward compatibility is available AT
> backward-compatibility5_1-to-7_0 [3]

That report is a very useful starting point, but doesn't seem to explain 
things at the human level.  What changes have there been to previously 
supported characters, and why, in terms of Unicode character properties, 
are those changes correct changes?  Maybe something more verbose that 
names the characters individually and states what the old ctype 
information was, and what the new information is, and what the relevant 
Unicode proeprties are that explain the new information, would help.

You're changing how upper/lower/alpha properties are generated.  Does that 
fix bug 14010?  If so, you can include [BZ #14010] in your ChangeLog 
entry.  Does it obsolete the special cases in 
gen-unicode-ctype.c:is_alpha?  If so, you should remove the parts of 
gen-unicode-ctype.c that are no longer used.  You should also confirm that 
each of the special cases there is properly handled by the new logic - or 
state explicitly that the handling of certain identified characters with 
special cases is being deliberately changed, because the Unicode 
properties for those characters are better than the special-case handling.

> diff --git a/include/stdc-predef.h b/include/stdc-predef.h
> index 87e3666..f96d308 100644
> --- a/include/stdc-predef.h
> +++ b/include/stdc-predef.h
> @@ -50,8 +50,9 @@
>  #endif
>  
>  /* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
> -   Unicode 6.0.  */
> -#define __STDC_ISO_10646__		201103L
> +   Unicode 6.0.
> +   Unicode 7.0.0 Published on 2014 June 16   */
> +#define __STDC_ISO_10646__		201406L

The date is meant to correspond to ISO/IEC 10646 publication dates, not 
Unicode publication dates.

Now, the most recent published amendment is amendment 1 from 2013-04-15 
(Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, 
and other characters).  WG2 N4566 states an intent for Unicode 7.0 to 
synchronize with amendment 2 to the 2012 edition of ISO/IEC 10646.  
However, I can't locate a proposed publication date for that amendment (or 
for the 2014 edition of ISO/IEC 10646 - and work appears to be underway on 
amendments 1 and 2 to the 2014 edition, even before it's published).  So 
maybe put 201304L there until such an amendment is published.

> diff --git a/scripts/check-backcompatibility.py b/scripts/check-backcompatibility.py
> new file mode 100755
> index 0000000..a56ac0a
> --- /dev/null
> +++ b/scripts/check-backcompatibility.py

I think in scripts/ the name should be more specific about *what* is 
having compatibility checked - scripts/ is for all of glibc, not just 
locale data.

> +# Copyright (C) 2013-14, Pravin Satpute <psatpute@redhat.com>

glibc contributions should be assigned to the FSF (and miscellaneous 
programs would normally by GPLv2+ / LGPLv2.1+ unless there is some reason 
to deviate from the norm for such programs in glibc).

-- 
Joseph S. Myers
joseph@codesourcery.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]