This is the mail archive of the guile@cygnus.com mailing list for the guile project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: mbstrings

To: Shiro Kawai <shiro@sqush.squareusa.com>
Subject: Re: mbstrings
From: Jim Blandy <jimb@red-bean.com>
Date: Wed, 15 Oct 1997 04:15:37 -0400
Cc: Guile Mailing List <guile@cygnus.com>
References: <199710142037.QAA03503@totoro.red-bean.com><199710150320.RAA05070@id004>


>I've heard the drawback of Unicode is it's not organized well for
>converting to/from existing character set.   (Unicode depends on
>character shape, but existing Chinese character encodings and
>Japanese character encodings are completely different even they
>share a lot of same shape characters, that means you need big
>lookup table for conversion.)

(Jim digs through the boxes in the hallway, and pulls out his copy of
"Understanding Japanese Information Processing", published by O'Reilly...)

True, there's no terse way to express the correspondence between
Unicode and JIS.  But the tables aren't too bad.  The Unicode
Consortium provides them in machine-readable form.

The most important Japanese character set encoding is JIS X 0208-1990,
a sixteen-bit character set.  We would like to be able to translate
between JIS 0208 and Unicode.

Taking the simplest and fastest approach, one needs two tables of
65535 characters each, mapping characters from Unicode to JIS 0208 and
back.  65536 chars/table * (2 bytes/char) * 2 tables = 256k bytes.

You could reduce this by noticing that each of the two bytes of a JIS
character fall in the 0x21 -- 0x7e range, and using a two-dimensional
table to represent it, instead of a one-dimensional table; this would
reduce the storage required for the JIS->Unicode table to:
  94 rows * (4 bytes/row)   (for the row table)
  + 94 rows * (94 chars / row) * (2 bytes / char)
  = 18048 bytes for JIS->Unicode table

If you take advantage of the fact that only the I-Zone (Ideograph
zone) of Unicode maps to JIS (characters in the range 0x4e00 --
0x9fff), then the storage required for the Unicode->JIS table is:
  (0x9fff - 0x4e00 + 1) chars * (2 bytes / char)
  = 41984 bytes for Unicode->JIS table
(This isn't quite true, as JIS contains Cyrillic and Greek characters,
and other miscellanea.  But it's not too much.)

So the total required space would be:
   18048 bytes + 41984 bytes
   = 60032 bytes

That's not so bad, considering the benefits of Unicode.

The file ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT
contains a mapping between JIS 0208 and Unicode.  Comments at the top
explain:

  General notes:

  This table contains the data the Unicode Consortium has on how
  JIS X 0208 (1983) characters map into Unicode.

  Format:  Four tab-separated columns
	   Column #1 is the shift-JIS code (in hex)
	   Column #2 is the JIS X 0208 code (in hex as 0xXXXX)
	   Column #3 is the Unicode (in hex as 0xXXXX)
	   Column #4 the Unicode name (follows a comment sign, '#')

	      The official names for Unicode characters U+4E00
	      to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX",
	      where XXXX is the code point.  Including all these
	      names in this file increases its size substantially
	      and needlessly.  The token "<CJK>" is used for the
	      name of these characters.  If necessary, it can be
	      expanded algorithmically by a parser or editor.

  The entries are in JIS X 0208 order

I got sick of not really knowing what's going on here, so here's some
Guile code which deals with some of this stuff.  Download the file
mentioned above, and then load the text below, and you'll get
functions that convert between unicode and JIS 0208.

I had heard rumors that you couldn't convert from JIS-X-0208 to
Unicode and back perfectly, but you can.  The code below checks for
collisions, and it doesn't report any.  I would be curious to hear
whether this is true when one throws JIS-X-0212 (another character
set, with more Kanji) into the works as well.

If both work, then I am at a loss to understand why there is so much
resistance to Unicode in Japan: the tables are not large, characters
aren't lost, and you get access to the rest of the world's scripts...


;;;; unicode.scm --- Unicode<->JIS tables
;;;; Jim Blandy <jimb@red-bean.com> --- October 1997

(use-modules (ice-9 regex))


;;;; Parsing the tables provided by the Unicode Consortium

;;; (read-jis-x-0208-table PORT)
;;;
;;; Parse the Unicode/JIS-X-0208 table, read from PORT.  Return it as
;;; an association list, mapping JIS to UNICODE.
(define read-jis-x-0208-table

  ;; Precompile these regexps.  Perl would do this automatically.  *pout*
  (let ((line-regexp
	 (make-regexp "^0x([0-9A-Z]+)\t0x([0-9A-Z]+)\t0x([0-9A-Z]+)"))
	(comment-regexp
	 (make-regexp "^[ \t]*#")))

    (lambda (port)
      (display "Parsing Unicode tables, and rewriting in Schemey form...")
      (newline)
      (let loop ((table '())
		 (i 0))
	(if (and (> i 0) (zero? (remainder i 100)))
	    (begin
	      (display "\r")
	      (display i)
	      (display " entries processed...")
	      (force-output)))
	(let ((line (read-line port)))
	  (cond
	   ((eof-object? line)
	    (display "\r")
	    (display i)
	    (display " entries processed... done.") (newline)
	    table)
	   ((regexp-exec line-regexp line)
	    => (lambda (m)
		 (let ((jis     (string->number (match:substring m 2) 16))
		       (unicode (string->number (match:substring m 3) 16)))
		   (loop (cons (cons jis unicode) table) (+ i 1)))))
	   ((regexp-exec comment-regexp line)
	    (loop table i))
	   (else
	    (error "read-jis-x-0208-table: odd line:" line))))))))

(define (build-scm-from-jis-x-0208)
  (let ((in (open-input-file "JIS0208.TXT")))
    (let ((table (read-jis-x-0208-table in)))
      (close-port in)
      (let ((out (open-output-file "jis-x-0208.scm")))
	(display ";;;; jis-x-0208.scm --- tables mapping between" out)
	(display " JIS-X-0208 and Unicode" out)
	(newline out)
	(display ";;;; Generated automatically by unicode.scm" out)
	(newline out)
	(newline out)
	(write (list 'define 'jis-x-0208<->unicode (list 'quote table)) out)
	(newline out)
	(close-port out)))))

;;; If we haven't parsed the Consortium's data files, do so now.
(if (not (file-exists? "jis-x-0208.scm"))
    (build-scm-from-jis-x-0208))

(load "jis-x-0208.scm")



;;;; Mapping from JIS-X-0208 to Unicode

;;; A vector of vectors mapping JIS-X-0208 to Unicode.
;;;
;;; If T is table:jis-x-0208->unicode, then given a JIS character
;;; whose high byte is H and whose low byte is L:
;;;   (vector-ref (vector-ref T (- H 33)) (- L 33))
;;; is the corresponding Unicode character, or #f if there is no
;;; corresponding Unicode character, or the character is not a valid
;;; JIS character.
(define table:jis-x-0208->unicode
  (let ((t (make-vector 94 '())))
    (do ((i 0 (+ i 1)))
	((>= i 94))
      (vector-set! t i (make-vector 94 #f)))
    (for-each (lambda (entry)
		(let* ((jis (car entry))
		       (row (- (quotient jis 256) 33))
		       (column (- (remainder jis 256) 33))
		       (unicode (cdr entry)))
		  (if (vector-ref (vector-ref t row) column)
		      (error "make-jis-x-0208->unicode-table:"
			     "mapping is not unique:"
			     jis "maps to" unicode "and"
			     (vector-ref (vector-ref t row) column)))
		  (vector-set! (vector-ref t row) column unicode)))
	      jis-x-0208<->unicode)
    t))


;;; (jis-x-0208->unicode JIS-CHAR) is the Unicode equivalent of
;;; the JIS character CHAR, or #f if there is no equivalent.
(define (jis-x-0208->unicode jis)
  (let ((row (quotient jis 256))
	(col (remainder jis 256)))
    (if (and (<= #x21 row #x7e)
	     (<= #x21 col #x7e))
	(vector-ref (vector-ref table:jis-x-0208->unicode (- row 33))
		    (- col 33))
	#f)))


;;;; Mapping from Unicode to JIS-X-0208

;;; table:unicode->jis-x-0208 is a vector of vectors mapping Unicode
;;; space to JIS-X-0208.  Given a Unicode character whose most
;;; significant byte is ROW and whose least significant byte is COL,
;;; if
;;;    (vector-ref table:unicode->jis-x-0208 ROW)
;;; is #f, then there is no corresponding JIS character; otherwise,
;;;    (vector-ref (vector-ref table:unicode->jis-x-0208 ROW) COL)
;;; is the corresponding JIS character, or #f if there is none.
(define table:unicode->jis-x-0208
  (let ((t (make-vector 256 #f)))
    (for-each (lambda (entry)
		(let* ((jis (car entry))
		       (unicode (cdr entry))
		       (row (quotient unicode 256))
		       (col (remainder unicode 256)))
		  (if (not (vector-ref t row))
		      (vector-set! t row (make-vector 256 #f)))
		  (if (vector-ref (vector-ref t row) col)
		      (error "make-unicode->jis-x-0208-table:"
			     "mapping is not unique:"
			     unicode "maps to" jis "and"
			     (vector-ref (vector-ref t row) col)))
		  (vector-set! (vector-ref t row) col jis)))
	      jis-x-0208<->unicode)
    t))

;;; (unicode->jis-x-0208 CHAR) is the JIS-X-0208 equivalent of the
;;; Unicode character CHAR, or #f if there is no equivalent.
(define (unicode->jis-x-0208 unicode)
  (let ((row (vector-ref table:unicode->jis-x-0208 (quotient unicode 256))))
    (and row (vector-ref row (remainder unicode 256)))))

Follow-Ups:
- Re: mbstrings
  - From: Miroslav Silovic <silovic@mare.zesoi.fer.hr>

References:
- Re: mbstrings
  - From: Jim Blandy <jimb@red-bean.com>
- Re: mbstrings
  - From: Shiro Kawai <shiro@sqush.squareusa.com>
- Re: mbstrings
  - From: Shiro Kawai <shiro@sqush.squareusa.com>

Prev by Date: Re: mbstrings
Next by Date: Re: mbstrings
Prev by thread: Re: Japanese and Unicode
Next by thread: Re: mbstrings
Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]