Bug 1354

Summary:	Problem in Turkish Locale Data (tr_TR.UTF-8 , Unicode)
Product:	glibc	Reporter:	Devrim GUNDUZ <devrim>
Component:	localedata	Assignee:	GNU C Library Locale Maintainers <libc-locales>
Status:	RESOLVED INVALID
Severity:	normal	CC:	glibc-bugs
Priority:	P2	Flags:	fweimer: security-
Version:	2.3.4
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:

Description Devrim GUNDUZ 2005-09-19 08:25:32 UTC

Hi,

We are experiencing problems in Turkish locale. 

$ cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 1)
$ rpm -qv glibc
glibc-2.3.4-2.9

Here is a short description of our problem:

==========================================================
test=# SELECT * from unicode_test WHERE a ILIKE 'ö%';
 a
----
 ös
(1 row)

test=# SELECT * from unicode_test WHERE a ILIKE 'Ö%';
 a
---
(0 rows)

test=#
==========================================================

Now details:

In Turkish, we have a special letter called "ö" (o with dots on it). The capital
of it is also "Ö" (O (capital o) with dots on it. Same stands for i and &#304;, &#305; and I.

I'm using PostgreSQL 8.0.3. PostgreSQL relies on the operating system for string
operations. So if there is something wrong in glibc, PostgreSQL also fails on
that locale.

As you can see in the short description, the ILIKE (Incasensitive LIKE) cannot
find the correct result. The same problem stands for i and i dotless (&#304; and I,
respectively). Both database queries should return the same result.

I hope this can be fixed in the next release of glibc.

Comment 1 Jakub Jelinek 2005-09-19 08:38:28 UTC

This is very likely an application bug (postgresql in this case), you need
to sort it out there, not here.  In case it would be a glibc bug
(unlikely, e.g. towlower/towupper etc. are known to work just fine
with Turkish dotless i/I and i/I with dot above), this would still be wrong
bugreport here.  For a bugreport here, you need to provide a self-contained
testcase that uses just glibc and shows the bug, or show say in ltrace
what calls return incorrect values.  Otherwise everybody could claim something
is a glibc bug and we'd have to debug all application bugs just in case
they might be glibc bugs.

Especially with Turkish i/I and case insensitivity where UTF-8 representation
is one byte for one case and 2 byte for the other case (and one is ASCII, while
the other is not), really many application don't handle this well.