This is the mail archive of the glibc-bugs@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/645] New: localedef does not respect rule definitions in LC_COLLATE


Executive summary: several bugs in ld-collate.c make localedef produce
  wrong collation data, here is a detailed analysis and a patch.

Sorting with French locales is special because diacritics are considered
from right to left, as described in ISO-14651 and many other documents.
And indeed, localedata/locales/iso14651_t1 contains
  order_start <LATIN>;forward;backward;forward;forward,position
An example is available at
  http://www.open-std.org/jtc1/sc22/wg20/docs/n602.htm#AnnexC
and fr_FR sort this text as if the backward directive had no effect.

I wrote simple tests to debug this problem; the xx_XX.tmpl locale file
defines a and A characters with the rule forward;forward;forward;forward,
and b, B with the rule forward;backward;forward;forward.
The tst-coll-rule program gets pairs of characters (with the same
primary level but different secondary level) as arguments, and
displays the direction of the 2nd level (f=forward, b=backward) for each
pair.
  $ export LOCPATH=$(mktemp -d /tmp/localedef.XXXXXX)
  $ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
  $ LC_ALL=xx_XX ./tst-coll-rule aA bB 
  bb
After switching definitions for S1 and S2:
  $ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
  $ LC_ALL=xx_XX ./tst-coll-rule aA bB 
  ff

So the last definition wins and overwrites the other one.  This is
due to the optimization of rulesets in ld-collate.c, line 1843 needs
to be changed from
  memcmp (osect->rules, sect->rules, nrules) == 0
to
  memcmp (osect->rules, sect->rules, nrules * sizeof (*osect->rules)) == 0

This patch being applied and xx_XX.tmpl reverted to its initial value,
we got now:
  $ localedef -i xx_XX.tmpl -f ISO-8859-1 $LOCPATH/xx_XX
  $ LC_ALL=xx_XX ./tst-coll-rule aA bB 
  bb

Huh?  This patch does not look that good, and some more digging in
ld-collate.c is needed.  There are named sections, at most one unnamed
section (defined without script name, e.g. order_start forward;forward)
and a symbol section, which stores symbols if they are read before the
first rule.

The test-collate.sh shell script defines all combinations of 2 level
scripts, and runs tst-coll-rule to check whether stored collation data
match their definition.  Output is;
  1st field: LC_COLLATE definition
    s: there is a symbol section, i.e. symbols are defined before the
       first order_start keyword.
    N: order_start <script_name>;forward;forward
    n: order_start <script_name>;forward;backward
    U: order_start forward;forward
    u: order_start forward;backward
  2nd field: output of "LC_ALL=xx_XX tst-coll-rule aA bB", or **
      when localedef segfaults.
  3rd field: expected output
  4th field: 0=match  1=mismatch  *=localedef segfaults

Current CVS version:
  snn bb bb 0 | sNn bb fb 1 | nn ** bb * | Nn ** fb * 
  snu bb bb 0 | sNu bb fb 1 | nu bb bb 0 | Nu bb fb 1 
  snN ff bf 1 | sNN ff ff 0 | nN ** bf * | NN ** ff * 
  snU ff bf 1 | sNU ff ff 0 | nU ff bf 1 | NU ff ff 0 
  sun bb bb 0 | sUn bb fb 1 | un bb bb 0 | Un bb fb 1 
  suN ff bf 1 | sUN ff ff 0 | uN ff bf 1 | UN ff ff 0 
After applying the one-line patch described above:
  snn bb bb 0 | sNn bb fb 1 | nn ** bb * | Nn ** fb *
  snu bb bb 0 | sNu fb fb 0 | nu bb bb 0 | Nu bb fb 1
  snN ff bf 1 | sNN ff ff 0 | nN ** bf * | NN ** ff *
  snU bf bf 0 | sNU ff ff 0 | nU ff bf 1 | NU ff ff 0
  sun bb bb 0 | sUn bb fb 1 | un bb bb 0 | Un bb fb 1
  suN ff bf 1 | sUN ff ff 0 | uN ff bf 1 | UN ff ff 0
After applying ld-collate.patch:
  snn bb bb 0 | sNn fb fb 0 | nn bb bb 0 | Nn fb fb 0
  snu bb bb 0 | sNu fb fb 0 | nu bb bb 0 | Nu fb fb 0
  snN bf bf 0 | sNN ff ff 0 | nN bf bf 0 | NN ff ff 0
  snU bf bf 0 | sNU ff ff 0 | nU bf bf 0 | NU ff ff 0
  sun bb bb 0 | sUn fb fb 0 | un bb bb 0 | Un fb fb 0
  suN bf bf 0 | sUN ff ff 0 | uN bf bf 0 | UN ff ff 0
which looks much better.  And indeed, my French locale now sorts
the sample file as expected, great.

-- 
           Summary: localedef does not respect rule definitions in
                    LC_COLLATE
           Product: glibc
           Version: 2.3.4
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: barbier at linuxfr dot org
                CC: glibc-bugs at sources dot redhat dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=645

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]