This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Unicode width data inconsistent/outdated

From: Thomas Wolff <towo at towo dot net>
To: cygwin at cygwin dot com
Date: Tue, 8 Aug 2017 02:28:54 +0200
Subject: Re: Unicode width data inconsistent/outdated
Authentication-results: sourceware.org; auth=none
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058@towo.net> <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de> <401b6d26-35cb-3026-afde-6bd5d09b2d71@SystematicSw.ab.ca> <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b@towo.net> <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca>

Am 07.08.2017 um 23:29 schrieb Brian Inglis:

On 2017-08-07 13:30, Thomas Wolff wrote:

Am 07.08.2017 um 21:07 schrieb Brian Inglis:

Implementation considerations for handling the Unicode tables described in
     http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
and implemented in
     https://www.strchr.com/multi-stage_tables

ICU icu4[cj] uses a folded trie of the properties, where the unique property
combinations are indexed, strings of those indices are generated for fixed size
groups of character codes, unique values of those strings are then indexed, and
those indices assigned to each character code group. The result is a multi-level
indexing operation that returns the required property combination for each
character.

https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode


The FOX Toolkit uses a similar approach, splitting the 21 bit character code
into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
eliminate redundancy.

ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf

Thanks for the interesting links, I'll chech them out.
But such multi-level tables don't really help without a given procedure how to
update them (that's only available for the lowest level, not for the
code-embedded levels).

Unicode estimates property tables can be reduced to 7-8KB using these
techniques, including using minimal int sizes for indices and array elements e.g
char, short if you can keep the indices small, rather than pointers.

Creation scripts used by PCRE and Python projects are linked from the bottom of
the second link above. Source and docs for these packages and ICU is available
under Cygwin, and FOX Toolkit is available in some distros and by FTP.

Also, as I've demonstrated, my more straight-forward and more efficient approach
will even use less total space than the multi-level approach if packed table
entries are used.

Unicode recommends the double table index approach as a means of eliminating the
massive redundancy that exists in char property entries and char groups, and
using small integers instead of pointers, that can be optimized to meet
conformance levels and platform speed and size limits, at the cost of an annual
review of properties and rebuild. The amount of redundancy removed by this
approach is estimated in the FOX Toolkit doc and ranges across orders of
magnitude. Unfortunately none of these docs or sources quote sizes for any
Unicode release!

My own first take on these was to use run length encoded bitstrings for each
binary property, similar to database bitmap indices, but the grouping of
property blocks in Unicode, and their recommendation, persuaded me their
approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
similar to those used for decades in database queries handling (lots of) small
value set equivalence class columns to reduce memory pressure while speeding up
selections.

I am not quite sure what you're trying to suggest or recommend now, butthe thing is, I just wanted to get an update of width data in the firstplace, which is an easy and undisputed changed; then Corinna pointed outthat the ctype functions are based on old Unicode data too, so I made anattempt to update them too. I use the approach that I also use for twoother projects (mined and mintty) and I didn't mean this to become aresearch project for me :/I am certainly willing to consider specs and all that to achieve asuitable result, but I don't feel like implementing any fancy algorithmrecommended by Unicode with unconvincing rationale, especially afterI've calculated that my method uses even less memory.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

References:
- Re: Unicode width data inconsistent/outdated
  - From: Thomas Wolff
- Re: Unicode width data inconsistent/outdated
  - From: Corinna Vinschen
- Re: Unicode width data inconsistent/outdated
  - From: Thomas Wolff
- Re: Unicode width data inconsistent/outdated
  - From: Corinna Vinschen
- Re: Unicode width data inconsistent/outdated
  - From: Brian Inglis
- Re: Unicode width data inconsistent/outdated
  - From: Thomas Wolff
- Re: Unicode width data inconsistent/outdated
  - From: Brian Inglis

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]