This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Identifying when collations change


On 03 Jul 2015 15:16, Craig Ringer wrote:
> The PostgreSQL database relies on the collation support of the
> underlying platform, which in GNU/Linux is glibc. This works very well
> for most purposes, but a problem arises when the collation rules are
> updated by the platform due to bug fixes or changes in accepted
> language rules.
> 
> PostgreSQL builds persistent on-disk b-tree indexes by executing the
> system C library collation functions - strcoll or strcoll_l. Correct
> searching of these indexes requires that the C library collation
> function behaviour be pure and immutable, i.e. that any two calls over
> any time period will return the same result for any given input.
> Collation updates break that assumption, and indexes must be rebuilt
> (REINDEXed) to ensure correct queries.
> 
> If PostgreSQL had a way to detect when the collation definition an
> index was built with differed from the current collation definition it
> would be very helpful, as we could then alert users to the situation,
> or even repair the index if we could tell *what* changed, not just
> that something changed.

i don't know about a portable answer, but perhaps extending nl_langinfo would
be more on the painless side of things ?  adding a GNU-specific keyword that'd
return a hash of the collation data so you could easily check. </naive>

> This isn't only an issue with collation updates on one machine. It
> also applies when a database is binary-replicated to another host with
> a different glibc version. Queries on the replica may produce
> incorrect results if the collations differ, and currently we have no
> way to detect this situation.

what about binary replications between OS's or different C libraries ?  or is 
that not supported ?

> The alternative to detecting and reporting issues with platform
> collation changes is dropping the use of operating system collation
> support in favour of a portable library like ICU. That's undesirable
> for a number of reasons: ICU uses UTF-16 internally while PostgreSQL
> uses UTF-8, so there'd be ugly conversion overheads, and that's just
> one of the issues. It'd also potentially cause PostgreSQL's collation
> results to differ from that of the platform it runs on. I'd rather
> avoid that, so I'm really interested in a way to find out when glibc
> collations change, or even better a portable way to do it and possibly
> even derive what changed.

how would ICU help you determine when collation data updates ?  ICU too sees 
updates to its collation database that you'd need to detect at runtime.
-mike

Attachment: signature.asc
Description: Digital signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]