This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RFC: locale-source validation script

From: Zack Weinberg <zackw at panix dot com>
To: GNU C Library <libc-alpha at sourceware dot org>
Cc: Mike FABIAN <maiku dot fabian at gmail dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>, "Carlos O'Donell" <carlos at redhat dot com>, Florian Weimer <fweimer at redhat dot com>
Date: Tue, 25 Jul 2017 21:04:11 -0400
Subject: RFC: locale-source validation script
Authentication-results: sourceware.org; auth=none

In the thread about using fewer <Uxxxx> escapes in the locale source
files, Carlos was concerned that, if we went over to UTF-8 for
everything, not just decoded the escapes that represent ASCII, it
would be easy for people to miss incorrectly encoded character
sequences - text that isn't normalized, for instance, or homograph
characters that *look* OK but are incorrect for the locale's language.

It seems to me that this sort of check is not something that humans
should have to do by eye; rather, it's a job for a linter.  So I wrote
one. :)  It currently looks for "inappropriate" escape sequences and
characters, using a quite strict notion of "inappropriate"; for
strings that are not in Unicode Normalization Form C; and for strings
that cannot be transcoded to the legacy charset for the locale (as
defined by a "% Charset: xxx" annotation in the file - note that not
all the files have such annotations).

It is not ready for prime time; it is very slow (Python isn't really
designed to go character-by-character through a file; it can probably
be sped up with a cleverer lexer) and it finds a whole bunch of
existing errors, some of which may not actually be _problems_, if you
see what I mean.  I've attached the script and the result of running
it over all of the files in localedata/locales/.  But it's ready for
people to poke at.

Some notes on what I found with it:

 - Many of the existing locale files have non-ASCII text in their
comments.  This text is _invariably_ encoded in UTF-8.  That no one
has complained about this is a weak argument in favor of it being safe
to go ahead with UTF-8 - there might be localedef implementations that
accept non-ASCII in comments but not elsewhere, I suppose.

 - A few of the existing locale files have non-ASCII text in strings
already.  Again, this is invariably encoded in UTF-8, and I think so
far it's limited to LC_IDENTIFICATION (accented characters in the
author's name, that sort of thing).

 - Quite a few of the existing locale files have strings outside
LC_IDENTIFICATION that contain "raw" ASCII already.  (This is why I
had to write a full-on lexer for the format; existing files contain
both % inside "" strings, and " characters inside % comments.  That
was a step beyond what I felt like doing with regexes.)

- There are quite a few strings that aren't NFC and I suspect it's
going to take expert knowledge of the languages involved to tell if
that's desirable.

- A significantly cleverer homograph checker is wanted, one that keys
off of the ISO language code, rather than the legacy charset.  (The
legacy-charset check is already done by localedef, AFAIK, and
localedef has more complete information when it does that.)

- The complaints about "inappropriate character '\t'" are all caused
by _unintentional_ tabs inside strings.  If you write

message "xyz/
         abc"

the whitespace on the second line gets included in the string, which
is not what you want.  The linter currently only detects this when
that indentation is done with tabs, but I think it should probably
detect spaces as well.  If you _mean_ to put a tab in a string write
<U0009>. :-)

- All of the complaints about "inappropriate escape sequences" boil
down to people forgetting that / is an escape character in these
files, and writing strings with slashes in them.  This is limited to
LC_IDENTIFICATION, so it's cosmetic, but it's still wrong and IMNSHO
justifies the linter insisting that you're only supposed to use / to
escape <>/".

- Speaking of, why is it that every single locale source file uses %
for comments and / for escapes, instead of the default # for comments
and \ for escapes?  It seems gratuitous and it made the linter harder
to write.

 - Suggestions for additional checks are welcome.

zw

Attachment: locale-errs.txt
Description: Text document

#!/usr/bin/python3
# Validate locale definitions.
# Copyright (C) 2017 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
#
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# The GNU C Library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with the GNU C Library; if not, see
# <http://www.gnu.org/licenses/>.

"""Validate locale definition files in ways that are too complicated
or too expensive to code into localedef.  This script is run over all
locale definitions as part of 'make check', when Python 3 is available.

Currently this performs two checks on each string within each file on
the command line: it must be unchanged by Unicode NFC normalization,
and it must be representable in the legacy character set(s) declared in
an annotation (e.g. % Charset: ISO-8859-5, KOI8-R).
"""

import argparse
import codecs
import contextlib
import functools
import os
import re
import sys
import textwrap
import traceback
import unicodedata

from curses.ascii import isgraph

class ErrorLogger:
    def __init__(self, ofp, verbose):
        self.ofp     = ofp
        self.verbose = verbose
        self.status  = 0
        self.fname   = None
        self.fstatus = 0

    def begin_file(self, fname):
        self.fname   = fname
        self.fstatus = 0
        if self.verbose:
            self.ofp.write(self.fname)
            self.ofp.write("... ")

    def end_file(self):
        if self.fstatus:
            self.status = 1
        elif self.verbose:
            self.ofp.write("OK\n")

    def error(self, lineno, message, *args):
        if self.verbose:
            if self.fstatus == 0:
                self.ofp.write("\n")
            self.ofp.write("  ")
        if args:
            message = message.format(*args)
        self.ofp.write("{}:{}: {}\n".format(self.fname, lineno, message))

        self.fstatus = 1

    def oserror(self, filename, errmsg):
        # If all these things are true, the last thing printed was the
        # filename that provoked an OS error (e.g. we failed to open the
        # file we're logging for) so just print the error message.
        if self.verbose and self.fname == filename and self.fstatus == 0:
            self.ofp.write(errmsg)
            self.ofp.write("\n")
        else:
            if self.verbose:
                if self.fstatus == 0:
                    self.ofp.write("\n")
                self.ofp.write("  ")
            self.ofp.write("{}: {}\n".format(filename, errmsg))

        self.fstatus = 1

    def exception(self):
        if self.verbose:
            if self.fstatus == 0:
                self.ofp.write("\n")
            prefix = "  "
        else:
            prefix = ""
            self.ofp.write("{}: error:\n".format(self.fname))

        for msg in traceback.format_exc().split("\n"):
            self.ofp.write(prefix)
            self.ofp.write(msg)
            self.ofp.write("\n")

        self.fstatus = 1

    def dump_codepoints(self, label, s):
        codepoints = [ord(c) for c in s]
        if any(c > 0xFFFF for c in codepoints):
            form = "06X"
        else:
            form = "04X"
        dumped = " ".join(format(c, form) for c in codepoints)
        if self.verbose:
            label = "  " + label
        self.ofp.write(textwrap.fill(dumped, width=78,
                                     initial_indent=label,
                                     subsequent_indent=" "*len(label)))
        self.ofp.write("\n")

@contextlib.contextmanager
def logging_for_file(log, fname):
    try:
        log.begin_file(fname)
        yield
    except OSError as e:
        log.oserror(e.filename, e.strerror)
    except Exception:
        log.exception()
    finally:
        log.end_file()

class PushbackWrapper:
    """Wrap around a file-like object and provide a pushback stack.
       Also counts line numbers for you, so that you don't double-count
       pushed-back newlines.

       This is not itself a file-like object; its only methods are
       get(), which returns a single character, and pushback().
       Also, although calling get() without pushback() will eventually
       _consume_ all of the underlying stream, this object does _not_
       own the underlying stream; in particular it will not close the
       underlying stream for you.
    """
    def __init__(self, fp):
        self.lineno = 1
        self._fp = fp
        self._pushback = []

    def get(self):
        if self._pushback:
            return self._pushback.pop()

        c = self._fp.read(1)
        if c == '\n':
            self.lineno += 1
        return c

    def pushback(self, c):
        self._pushback.append(c)

def inappropriate_unichar(c):
    """A relaxed definition of 'inappropriate character', currently used in
       comments only: arbitary Unicode characters are allowed, but not
       the legacy control characters (except TAB), nor the Unicode NIH
       line-breaking characters, nor bare surrogates, nor noncharacters.
       Private-use, not-yet-assigned, and format controls (Cf) are fine,
       except that BYTE ORDER MARK (U+FEFF) is not allowed.  OBJECT
       REPLACEMENT CHARACTER (U+FFFC) and REPLACEMENT CHARACTER (U+FFFD)
       are officially "symbols", but we weed them out as well, because
       their presence in a locale file means something has gone wrong
       somewhere.
    """
    cat = unicodedata.category(c)
    if cat == 'So' and (c == '\uFFFC' or c == '\uFFFD'):
        return True
    if cat == 'Zl' or cat == 'Zp' or cat == 'Cs':
        return True
    if cat == 'Cc' and c != '\t':
        return True
    if cat == 'Cf' and c == '\uFEFF':
        return True
    if cat == 'Cn' and ord(c) in {
            0x00FDD0, 0x00FDD1, 0x00FDD2, 0x00FDD3, 0x00FDD4, 0x00FDD5,
            0x00FDD6, 0x00FDD7, 0x00FDD8, 0x00FDD9, 0x00FDDA, 0x00FDDB,
            0x00FDDC, 0x00FDDD, 0x00FDDE, 0x00FDDF, 0x00FDE0, 0x00FDE1,
            0x00FDE2, 0x00FDE3, 0x00FDE4, 0x00FDE5, 0x00FDE6, 0x00FDE7,
            0x00FDE8, 0x00FDE9, 0x00FDEA, 0x00FDEB, 0x00FDEC, 0x00FDED,
            0x00FDEE, 0x00FDEF,

            0x00FFFE, 0x00FFFF, 0x01FFFE, 0x01FFFF, 0x02FFFE, 0x02FFFF,
            0x03FFFE, 0x03FFFF, 0x04FFFE, 0x04FFFF, 0x05FFFE, 0x05FFFF,
            0x06FFFE, 0x06FFFF, 0x07FFFE, 0x07FFFF, 0x08FFFE, 0x08FFFF,
            0x09FFFE, 0x09FFFF, 0x0AFFFE, 0x0AFFFF, 0x0BFFFE, 0x0BFFFF,
            0x0CFFFE, 0x0CFFFF, 0x0DFFFE, 0x0DFFFF, 0x0EFFFE, 0x0EFFFF,
            0x0FFFFE, 0x0FFFFF, 0x10FFFE, 0x10FFFF,
    }:
        return True
    return False

def tok_escape(fp, log, escape_char):
    """Consume an escape sequence from FP and return its value.  If the
       character escaped is not the escape_char, a newline, '"', '<', or
       '<', issue an error -- we want only <Uxxxx> used for anything
       else -- but do properly crunch the escape regardless."""
    c = fp.get()
    if c == 'x':
        # \x consumes one or two hexadecimal digits.
        maxchars = 2
        base = 16
        ok = "0123456789abcdef"
        digits = []
        prefix = escape_char + c
    elif c == 'd':
        # \d consumes one, two, or three decimal digits.
        maxchars = 3
        base = 10
        ok = "0123456789"
        digits = []
        prefix = escape_char + c
    elif c in "01234567":
        # \0 consumes one, two, or three octal digits.
        maxchars = 3
        base = 8
        ok = "01234567"
        digits = [c]
        prefix = escape_char
    else:
        # Not a numeric escape.
        if c not in ('\n', '"', '<', '>', escape_char):
            log.error(fp.lineno, "inappropriate escape sequence '{}{}'",
                      escape_char, c)
        return c

    while len(digits) < maxchars:
        d = fp.get()
        if d not in ok:
            fp.pushback(d)
            break
        digits.append(d)

    s = "".join(digits)
    log.error(fp.lineno, "inappropriate escape sequence '{}{}'",
              prefix, s)

    return chr(int(s, base))

def tokenize(fp, log):

    """Tokenize a locale definition file.  Yields a sequence of pairs
       (lineno, string).  May also emit error messages.
    """
    # Tokenizer state codes
    S_START   = 0  # in between tokens
    S_WORD    = 1  # foo, 123
    S_STRING  = 2  # "foo" or <foo>
    S_COMMENT = 3  # comment_char to EOL

    tbuf         = []
    tline        = None
    comment_char = '#'
    escape_char  = '\\'
    end_char     = None
    state        = S_START
    fp           = PushbackWrapper(fp)

    while True:
        c = fp.get()

        if state == S_START:
            if c == '': # end of file
                break

            if c == ' ' or c == '\t' or c == '\n':
                pass
            elif c == ',' or c == ';':
                yield (fp.lineno, c)
            elif c == comment_char:
                state = S_COMMENT
                tline = fp.lineno
                tbuf.append(c)
            elif c == '<':
                state = S_STRING
                end_char = '>'
                tline = fp.lineno
                tbuf.append(c)
            elif c == '"':
                state = S_STRING
                end_char = '"'
                tline = fp.lineno
                tbuf.append(c)
            elif c == escape_char:
                c = tok_escape(fp, log, escape_char)
                state = S_WORD
                tline = fp.lineno
                tbuf.append(c)
            elif isgraph(c):
                state = S_WORD
                tline = fp.lineno
                tbuf.append(c)
            else:
                log.error(fp.lineno, "inappropriate character {!r}", c)

        elif state == S_WORD:
            if c == escape_char:
                c = tok_escape(fp, log, escape_char)
                if c != '\n':
                    tbuf.append(c)

            elif isgraph(c) and c != comment_char and c not in ',;<"':
                tbuf.append(c)

            else:
                fp.pushback(c)
                state = S_START
                word = ''.join(tbuf)
                tbuf.clear()

                if word == "escape_char" or word == "comment_char":
                    c = fp.get()
                    while c == ' ' or c == '\t':
                        c = fp.get()
                    if c == '\n':
                        log.error(fp.lineno - 1, "empty {} directive", word)
                    elif c in ',;<"' or not isgraph(c):
                        log.error(fp.lineno, "{} may not be set to {!r}",
                                  word, c)

                    elif word == "escape_char":
                        if c == comment_char:
                            log.error(fp.lineno,
                                      "escape_char and comment_char "
                                      "may not be the same")
                        else:
                            escape_char = c

                    else:
                        if c == escape_char:
                            log.error(fp.lineno,
                                      "escape_char and comment_char "
                                      "may not be the same")
                        else:
                            comment_char = c
                else:
                    yield (tline, word)

        elif state == S_STRING:
            if c == escape_char:
                c = tok_escape(fp, log, escape_char)
                if c != '\n':
                    tbuf.append(c)

            elif c == '\n' or c == '' or c == end_char:
                if c != end_char:
                    log.error(fp.lineno - (0 if c == '' else 1),
                              "end of {} in {}",
                              "file" if c == '' else "line",
                              "string" if end_char == '"' else "symbol")

                state = S_START
                yield (tline, ''.join(tbuf))
                tbuf.clear()
                end_char = None

            else:
                # We don't accept tab here; inside a string, tab
                # should be <U0009> to make clear that it is
                # intentional.
                if c != ' ' and not isgraph(c):
                    log.error(fp.lineno, "inappropriate character {!r} in {}",
                              c, "string" if end_char == '"' else "symbol")
                else:
                    tbuf.append(c)

        elif state == S_COMMENT:
            # POSIX specifically says that comments are _not_ continued
            # onto the next line by the escape_char.
            if c == '\n' or c == '':
                state = S_START
                yield (tline, ''.join(tbuf))
                tbuf.clear()

            else:
                # In comments, we relax the definition of "inappropriate
                # character"; arbitrary Unicode is allowed.
                if inappropriate_unichar(c):
                    log.error(fp.lineno, "inappropriate character {!r}", c)
                else:
                    tbuf.append(c)

charset_re = re.compile("(?i)\bcharset: (.+)$")
charset_split_re = re.compile("[,; \t][ \t]*")

def add_charsets(line, lno, charsets, log):
    m = charset_re.search(line)
    if not m:
        return

    for cs in charset_split_re.split(m.group(1)):
        try:
            co = codecs.lookup(cs)
            if co.name not in charsets:
                charsets[co.name] = co

        except LookupError:
            log.error(lno, "unknown charset {!r}", cs)

unicode_symbol_re = re.compile("(?i)<U([0-9a-f]+)>")
def decode_unicode_symbols(s, lineno, log):
    """Convert <Uxxxx> tokens to the corresponding characters.
       Other symbolic names are left untouched."""
    try:
        return unicode_symbol_re.sub(lambda c: chr(int(c.group(1), 16)), s)
    except (UnicodeError, ValueError) as e:
        log.error("invalid <Uxxxx> token in string: {}", str(e))

def process(fp, log):
    strings = []
    charsets = {}

    for lno, tok in tokenize(fp, log):
        if tok[0] == '"':
            s = decode_unicode_symbols(tok[1:], lno, log)
            canon_s = unicodedata.normalize("NFC", s)
            if canon_s != s:
                log.error(lno, "string not normalized:")
                log.dump_codepoints("  source: ", s)
                log.dump_codepoints("     nfc: ", canon_s)

            strings.append((lno, canon_s))

        elif tok[0] == '%':
            add_charsets(tok[1:], lno, charsets, log)

        else:
            pass # ignore all other tokens for now

    for charset, codec in sorted(charsets.items()):
        for lno, s in strings:
            try:
                _ = codec.encode(s)
            except UnicodeEncodeError:
                log.error(lno, "string not representable in {}:", charset)
                log.dump_codepoints("    ", s)

def process_files(args):
    logger = ErrorLogger(sys.stderr, args.verbose)

    for f in args.files:
        with logging_for_file(logger, f), \
             open(f, "rt", encoding=args.encoding) as fp:
            process(fp, logger)

    return logger.status

def main():
    ap = argparse.ArgumentParser(description=__doc__)
    ap.add_argument("-v", "--verbose", action="store_true")
    ap.add_argument("-e", "--encoding", default="utf-8")
    ap.add_argument("files", nargs="+")
    args = ap.parse_args()
    sys.exit(process_files(args))

main()

Follow-Ups:
- Re: RFC: locale-source validation script
  - From: Andreas Schwab
- Re: RFC: locale-source validation script
  - From: Mike FABIAN
- Re: RFC: locale-source validation script
  - From: Mike FABIAN

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]