This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: charset changes


On 23.01.2010 12:05, Andy Koppe wrote:
I'm in awe at Corinna's latest locale changes. Getting closer and
closer to the real thing.
Me too.
A couple of points:
...
And here is my couple of points, after some checking:

I found the following inconsistencies, and since the agreed strategy seems to be to prefer Linux compatibility over Windows mapping,
I think especially the first group of a few incompatible mappings should be fixed before the 1.7.2 release.



------------------------------------------------------------------------ These locales have inconsistent encodings: Locale Linux Cygwin et_EE ISO-8859-1 ISO-8859-15 ja_JP.sjis SHIFT_JIS CP932 ka_GE GEORGIAN-PS UTF-8 kk_KZ PT154 ISO-8859-5 sr_CS ISO-8859-5 UTF-8 uz_UZ ISO-8859-1 UTF-8 zh_CN GB2312 GBK zh_HK BIG5-HKSCS BIG5 zh_SG GB2312 GBK

Notes:
- SHIFT_JIS -> CP932 has been discussed extensively and I think it's OK
- GB2312 -> GBK is basically a superset, should be OK too
- zh_HK is the dedicated Hongkong locale, so should use the Hongkong extension
- With respect to other differences above, linux has these two distinguished locales:
et_EE.iso885915 ISO-8859-15
uz_UZ@cyrillic UTF-8
- getlocale -a lists the following twice, without indicating a difference:
sr_SP
sr_BA
az_AZ
se_FI
uz_UZ (see above)



------------------------------------------------------------------------
Also, some generic encoding suffixes are not handled:
- .iso885915 and .iso8859-15 (cygwin only recognizes .iso-8859-15 and its capital)
- .koi8r (cygwin only recognizes .koi8-r and .KOI8-R)
- .koi8u (cygwin only recognizes .koi8-u and .KOI8-U)
- .tcvn (in vi_VN.tcvn)
- .gb18030 (in zh_CN.gb18030)
- .eucjp (in ja_JP.eucjp)
- .euctw (in zh_TW.euctw)
(Maybe the latter lack Windows support or depend on Windows configuration...)
- .koi8t
- .armscii8
- .big5hkscs
- .gb2312
- .georgianps
- .pt154
- .ujis (-> EUC-JP)



------------------------------------------------------------------------ These locales are not known or handled on cygwin at all: aa_DJ ISO-8859-1 aa_ER UTF-8 aa_ET UTF-8 am_ET UTF-8 an_ES ISO-8859-15 ar_IN UTF-8 ar_SD ISO-8859-6 ast_ES ISO-8859-15 ber_DZ UTF-8 ber_MA UTF-8 bn_BD UTF-8 bo_CN UTF-8 bo_IN UTF-8 br_FR ISO-8859-1 byn_ER UTF-8 ca_AD ISO-8859-15 ca_FR ISO-8859-15 ca_IT ISO-8859-15 crh_UA UTF-8 csb_PL UTF-8 de_BE ISO-8859-1 dz_BT UTF-8 el_CY ISO-8859-7 en_AG UTF-8 en_BE ISO-8859-1 en_BW ISO-8859-1 en_DK ISO-8859-1 en_HK ISO-8859-1 en_IN UTF-8 en_NG UTF-8 en_SG ISO-8859-1 es_US ISO-8859-1 fur_IT UTF-8 fy_DE UTF-8 ga_IE ISO-8859-1 gd_GB ISO-8859-15 gez_ER UTF-8 gez_ET UTF-8 gv_GB ISO-8859-1 ha_NG UTF-8 hne_IN UTF-8 hsb_DE ISO-8859-2 ht_HT UTF-8 ig_NG UTF-8 ik_CA UTF-8 iu_CA UTF-8 iw_IL ISO-8859-8 kl_GL ISO-8859-1 km_KH UTF-8 ks_IN UTF-8 ku_TR ISO-8859-9 kw_GB ISO-8859-1 lg_UG ISO-8859-10 li_BE UTF-8 li_NL UTF-8 lo_LA UTF-8 mai_IN UTF-8 mg_MG ISO-8859-15 nds_DE UTF-8 nds_NL UTF-8 ne_NP UTF-8 nl_AW UTF-8 no_NO ISO-8859-1 nr_ZA UTF-8 nso_ZA UTF-8 oc_FR ISO-8859-1 om_ET UTF-8 om_KE ISO-8859-1 or_IN UTF-8 pap_AN UTF-8 pa_PK UTF-8 ru_UA KOI8-U rw_RW UTF-8 sc_IT UTF-8 sd_IN UTF-8 shs_CA UTF-8 sh_YU ISO-8859-2 sid_ET UTF-8 si_LK UTF-8 so_DJ ISO-8859-1 so_ET UTF-8 so_KE ISO-8859-1 so_SO ISO-8859-1 ss_ZA UTF-8 st_ZA ISO-8859-1 tg_TJ KOI8-T ti_ER UTF-8 ti_ET UTF-8 tig_ER UTF-8 tk_TM UTF-8 tl_PH ISO-8859-1 tr_CY ISO-8859-9 ts_ZA UTF-8 ug_CN UTF-8 ve_ZA UTF-8 wa_BE ISO-8859-1 wo_SN UTF-8 yi_US CP1255 yo_NG UTF-8


------------------------------------------------------------------------
And finally, some systems (e.g. Fedora) maintain a number of full-word locales (locale aliases?) that are not known on cygwin either (maybe not harmful):
(Note: non-ASCII letters in some of the locale names on those systems are in 8-bit, Latin-1)
bokmal ISO-8859-1
bokmÃl ISO-8859-1
catalan ISO-8859-1
croatian ISO-8859-2
czech ISO-8859-2
danish ISO-8859-1
dansk ISO-8859-1
deutsch ISO-8859-1
dutch ISO-8859-1
eesti ISO-8859-1
estonian ISO-8859-1
finnish ISO-8859-1
franÃais ISO-8859-1
french ISO-8859-1
galego ISO-8859-1
galician ISO-8859-1
german ISO-8859-1
greek ISO-8859-7
hebrew ISO-8859-8
hrvatski ISO-8859-2
hungarian ISO-8859-2
icelandic ISO-8859-1
italian ISO-8859-1
japanese EUC-JP
korean EUC-KR
lithuanian ISO-8859-13
norwegian ISO-8859-1
nynorsk ISO-8859-1
polish ISO-8859-2
portuguese ISO-8859-1
romanian ISO-8859-2
russian ISO-8859-5
slovak ISO-8859-2
slovene ISO-8859-2
slovenian ISO-8859-2
spanish ISO-8859-1
swedish ISO-8859-1
thai TIS-620
turkish ISO-8859-9



------------------------------------------------------------------------ Thomas


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]