This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/10580] New: New file for hr_HR localedata


----------------
HISTORY:
----------------

hr_HR locale started out as a copy of sl_SI locale in glibc-2.0 and was
maintained by Borka Jerman-BlažiÄ? (from Slovenia), shortly afterwards Tomislav
Vujec (then at CARNet, now in RedHat) changed it to suite hr specific changes.
After around 1998, that locale was only updated by glibc maintainer Ulrich
Drapper who added or changed portions of it as mass updates to many locales.

I have contacted current maintainer, Tomislav Vujec, last week and he is willing
to support changes. Also, since it's been more then a decade since he did
changes to this locale, he noted that he'd be willing to pass maintainership to
someone else. BTW, he is also maintainer of bs_BA, I hope Bosnian translation
team will take over maintenance of that locale...


----------------
RATIONALE:
----------------

The point is: hr_HR locale is now in a state of flux. It kind of works and fails
in fairly subtle ways when sorting digraphs. I have made numerous changes which
I'll describe below...

Croatia doesn't have language law or any real specification of the language
rules for writing dates, monetary data and so on. Most of the language decisions
in real life are made using common established conventions. I'll rationale my
decisions in my change descriptions below, using URLs where needed...

I really wanted to make this right, so I've read all of the archive of
libc-locale mailing list (2004-now), and also ISO/IEC TW 14652 (albeit 2002
edition which I found for free on the Internet). I've looked at history of
changes of hr_HR locale through "git blame". I've also studied at sr_RS locale
which is somewhat related to hr_HR since Croatian, Bosnian and Serbian have (or
had) lot of common conventions. Initially I only wanted to change LC_COLLATE,
but it made sense to update locale as a whole, so it required far more time than
I anticipated, but changes made are worth it. I've (heavily) commented the
locale, so it should be easy to maintain from now. UTF-8 characters are used
only in comments). I've also contacted all hr translators team leaders and
pointed them to this bugreport to give their opinion on these changes, since
they will be system wide when accepted, and they are, by definition, at the
forefront of i18n and l10n efforts.

There are some general locale system errors which are not specific to Croatian
locale, so if Ulrich Drapper (if he has some time) or someone else versed in in
glibc internals can look at change descriptions to LC_COLLATE, LC_ADDRESS and
LC_TELEPHONE, and help me a bit with system errors found there while using
localedef I'd be really thankful :o)


----------------
CHANGES:
----------------

% <initial comments>
	Mostly cleaned comments and removed repeatable information into
LC_IDENTIFICATION. Added that charset used in croatia should primarily be UTF-8.
Previously we used ISO-8859-2 (which should be phased out since it doesn't
support digraph characters [Ç?, Ç? and Ç?])
	Added my email in the authors list, just so I can be notified in the future
when the locale changes.

LC_IDENTIFICATION:
	I bumped revision to 2.0 (from 1.0) since this is a major rewrite of this locale.
	I have left "CARNet" and their address although I'm not really sure why CARNet
(Croatian Academic and Research Network) would have jurisdiction over hr_HR
locale. Not even Ministry of Education of Croatia has jurisdiction over it, as
they don't supply rules for writing dates, or monetary strings for example.
	category statements were updated to reflect new changes. Standard requires
first parameter to define to which standard this category complies but all other
locales just use locale name and a year here, so I did just that too.
	BTW most locales don't list all categories which are include in their file. For
example, they usually don't include LC_MEASUREMENT. I did...

LC_CTYPE:
	Although ISO/IEC TR 14652 has controversial LC_XLITERATE category, glibc uses
"translit_start" inside LC_CTYPE. Hence I've added transliteration info (how to
transliterate digraphs to ISO-8859-2 and ASCII). I'm not really sure how to test
this, I hope I got it right.
	There is some weird behaviour in included "i18n"... For example it has same
character in "upper" AND "lower" class, so both iswupper() and iswlower() give
TRUE for <U01C5> {Ç?}. I guess this is ok.
	Another behaviour is that towupper() will make <U01C6> {Ç?} -> <U01C4> {Ç?},
which can be wrong in some cases where <U01C5> {Ç?} is needed. This is not ok,
but not fixable in the current implementation anyway, so lets add it to
curiosities for now :o)

LC_COLLATE:
	Major revision. I have included "iso14651_t1" like most locales to reap
benefits of "iso14651_t1" updates, as well as to significantly reduce hr_HR
locale size and increase readability
	collating-elements are created and linked to the right digraphs [Ç?, Ç? and Ç?]
	BTW "collating-element" shouldn't be used after "copy", but many locales use it
since there is no other way, except putting them in "iso14651_t1"
	Croatian alphabet considers Ä?, Ä?, Ç?, Ä?, Ç?, Ç?, Å¡ and ž distinct letters, and
that was implemented with reorder-after statements
	localedef says I have SYNTAX ERROR in LC_COLLATE, probably not liking "<d><z>"
digraph literal. Is this really SYNTAX ERROR? It works though...

LC_TIME:
	Names of days and months are now written with right digraphs, and not a
combination of ASCII letters ( Digraphs can nowdays be seen in CLI apps as well.
For example `cal 2009`.
	d_t_fmt was changed to format like: "PonedjeÇ?ak, 31. Kolovoz 2009. 16:35:05
CEST" (The best we can in current implementation. Croatia uses declension in
month names like most Slavic languages) [ This format can be seen on Croatian
government pages http://vlada.hr/ ]
	date_fmt was changed to format like: "Pon, 31.08.2009.  16:49:36 CEST" [
Croatia in general doesn't use short versions of month or day names. For month
we usually use number as seen on pages of Croatian president [
http://www.predsjednik.hr/ ]
	d_fmt is changed to format like: "01.09.2009." for reasons same as in date_fmt
change explanation. Croatians read and write dd.mm.yyyy format for decades. If
someone objects that it confuses people who use mm.dd.yyyy (us) format, I agree,
but this is hr_HR locale and this form is widely used in Croatia. System
software should use YYYY-MM-DD format anyway regardles of locale.
	t_fmt is changed to format like: "HH:MM:SS"
	I've added week, first_weekday, first_workday. first_weekday and first_workday
are set to Monday

LC_NUMERIC
	I've set thousands_sep to '.' So formating of numbers is "12.345.678,90" or
"-12.345.678,90"

LC_MONETARY
	I've lowecased currency_symbol to "kn" since that form is what majority of
citizens/shops nowdays use. See online shops: http://www.links.hr/ ,
http://www.profil.hr/ , and many others. You can see there is no rule for this
at wikipedia: http://hr.wikipedia.org/wiki/Hrvatska_kuna , where they note that
Symbol is "Kn" but use "kn" a lot on the same page
	I've added thousands_sep to '.' as in LC_NUMERIC
	I've changed monetary string format to: "14.986,42 kn", "-14.986,42 kn" and for
international to "HRK 14.986,42" and "-HRK 14.986,42" as was agreed upon in 2003
by Tomislav Vujec on libc-alpha [
http://sourceware.org/ml/libc-alpha/2003-04/msg00254.html ]. I'm not really sure
that in international version HRK should be before the value (as said at the
top, there is no law on how to write monetary values in Croatia just
conventions). I'd leave them the same as local versions, and just use HRK
instead of kn but I've complied with libc-aplha agreement of 2003 for now.

LC_MESSAGES:
	I've removed trailing .* in yesexpr and noexpr as it was discussed in
libc-locales mailing list [
http://sources.redhat.com/bugzilla/show_bug.cgi?id=71 ] that it's not really
necessary.
	I didn't include 1 in yesexpr and 0 in noexpr although this was discussed in
libc-locales mailing list too. But not many locales use it, so I've skipped it
for now
	I've added yesstr, and nostr

LC_NAME:
	Changes name_fmt to "salutation name other_name surnames"
	I've added name_mr, name_mrs, and name_miss. Croatia doesn't have gender
neutral salutation, nor neutral female (name_ms) version of salutation

LC_ADDRESS:
	postal_fmt is changed, so that address now looks like:
		Company name
		Department name
		Person's name
		C/O Person or Organization
		Street name and house number
		ZIP Code and City name
		Country
	localedef complains that postal_fmt have invalid escape sequence, I don't know
why!?!
	I've added definitions for many missing attributes: country_post, country_car,
country_isbn, lang_name, lang_ab, lang_term and lang_lib

LC_TELEPHONE
	I've changed tel_int_fmt to look like: "+<country code> <area code without
leading 0> <local number> < possible ext>"
	I've changed tel_dom_fmt to look like: "<possible area code with leading 0>
<local number> <possible ext>"
	localedef complains that tel_int_fmt and tel_dom_fmt have invalid escape
sequence, I don't know why!?!

LC_PAPER
	A4 is used in Croatia

LC_MEASUREMENT
	Croatia uses metric measurements


----------------
TESTING:
----------------

To see the file without Uxxx literals, I made this ugly oneliner which make HTML
version of it. Just change file name at the start and you can use it with other
locale files as well.

( FILE=hr_HR; sed -e
's/<U\([0-9A-F][0-9A-F][0-9A-F][0-9A-F]\)>/\&lt;\&#x\1;\&gt;/g' < $FILE >
$FILE.tmp; sed -e 's/</\&lt;/g' < $FILE.tmp > $FILE.html; sed -e 's/>/\&gt;/g' <
$FILE.html > $FILE.tmp; echo "<pre>" > $FILE.html; cat $FILE.tmp >> $FILE.html;
rm $FILE.tmp )

Also to test collating in hr_HR locale I made small dictionary which has
Croatian digraphs in all forms, as well as letters which are considered distinct.
To test collation with it I do the following: randomize it with `sort -R`, and
resort it. The end file should have same MD5 as the starting one...

Testing of other locale categories is a bit harder, but small C programs work
well, and most code templates you have in glibc source / localedata anyway.

-- 
           Summary: New file for hr_HR localedata
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales at sources dot redhat dot com
        ReportedBy: invisible at hidden-city dot net
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=10580

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]