Sources Bugzilla – Bug 368
localedef fails with coplex LC_COLLATE rules
Last modified: 2005-10-14 22:57:09 UTC
I reached what seems to be a limitation in the numlber of LC_COLLATE collating-elements. I was trying to build a dz_BT locale (Dzongkha language, Buthan); the sorting rules are quite special, as for example nexy to <ka> entry are words starting with prefix attached to ka radical, eg: <da>-<ka>, <ba>-<ka> etc, come just after words starting with <ka>, and not with words starting with <da>, <ba>, etc. Said otherwise, the base collating elements are the 30 base letters, plus 103 prefix-radical collating elements. Now, it is even more complex that that; some letter sequences are prefix-radical or not depending on what follows them; eg for <da>-<ga> it is a prefix if followed with <ga>, <nga>, <da>,... but not otherwise. That is, it is needed to define collating elements comprising of the prefix element and the next char, which are then sorted as a digraph; eg: collating-element <rad-ga-d-ga> from "<U0F51><U0F42><U0F42>" ... <rad-ga-d-ga> "<TIB-GA-R_D><TIB-GA>";.... That mens there are a lot of collating-elements to define; 303 in total. But ifI use more than 265 the locale doesn't compile (localedef just runs forever taking 90% of cpu ressources doing nothing); while if I comment some of them to have no more than 265 in use, then it compiles nicely. I attach the preliminary dz_BT locale I was working on; some entries are commented with %%%% (four percent signs), so that the file can compile; but to have the rules complete, all those lines commented out with "%%%%" should be enabled as well.
Created attachment 187 [details] sample dz_BT locale (with several lines commented out with "%%%%" that should be enabled
Created attachment 332 [details] allow more than 256 collating-element definitions I could not find why elem_size has to be less than 257, and thus dropped this constraint. Then elem_size had to be computed more accurately in order to prevent allocation of large unused data. But your dz_BT file still did not compile because the secondary hashing function seems to do a poor job: iter was null and there is an endless loop. A better secondary hashing function is to add 1 to the current one, but the functions which read collation data would need to be fixed too. Instead, I chose to enlarge the table when such a loop is encountered.
As this patch only changes the multi-byte sequence, we can check whether wide-char and multi-byte collations give the same results, in which case this patch is certainly right. I created a file containing sequences of 2 Tibetan characters: $ for i in `seq 0x0F00 0x0FCF`; do for j in `seq 0x0F00 0x0FCF`; do printf "0: %08x %08x 0000000a " $i $j | xxd -r -g4 done done | iconv -f ucs4 -t utf8 > input_file Then ran $ LC_ALL=en_US.UTF-8 ./tst-wcscoll < input_file > out.wc-en_US $ LC_ALL=en_US.UTF-8 ./tst-strcoll < input_file > out.mb-en_US $ cmp out.wc-en_US out.mb-en_US $ So results are exactly similar. But to show that this patch allows more than 256 collating elements, we need to check with more complex LC_COLLATE sections. I took Pablo's locale file, s/^%%%%</</ to have more than 256 collating elements, and re-ran this test: $ export LOCPATH=`mktemp -d /tmp/test.XXXXXX` $ localedef.patched -i dz_BT -f UTF-8 $LOCPATH/dz_BT $ LC_ALL=dz_BT ./tst-wcscoll < input_file > out.wc-dz_BT $ LC_ALL=dz_BT ./tst-strcoll < input_file > out.mb-dz_BT $ cmp out.wc-dz_BT out.mb-dz_BT $ Looks good. Note that tst-strcoll is much slower than tst-wcscoll, which seems quite logical since the primary key is the first UTF-8 byte and does not change in the range 0x0F00-0x0FCF.
Created attachment 372 [details] C source file for the tst-strcoll program This program can only process files composed of lines of 2 UTF-8 characters, some modifications are needed to accept any input.
Created attachment 373 [details] C source file for the tst-wcscoll program This program can only process files composed of lines of 2 UTF-8 characters, some modifications are needed to accept any input.
localedef *still* only handles only 256 collating-element definitions. Cultrually correct (standard dictionary order) of langages like Dzongkha (dz_BT) and Tibetan (bo_CN) *require* over 350 ellements in LC_COLLATE
Created attachment 567 [details] dz_BT Collation - generated automatically from CLDR *
*** Bug 307 has been marked as a duplicate of this bug. ***
The ld-collate patch is wrong. I fixed it myself. I checked in the first locale. The second one is completely useless. If there are bugs in the file in CVS file a new bug and justify the change. As for the test programs: they work just fine the way they are.