This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/14487] New: need better UTF-8 handling


http://sourceware.org/bugzilla/show_bug.cgi?id=14487

             Bug #: 14487
           Summary: need better UTF-8 handling
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: runtime
        AssignedTo: systemtap@sourceware.org
        ReportedBy: jistone@redhat.com
    Classification: Unclassified


We generally take the blissful approach that all strings are merely
0-terminated byte sequences, and we don't care much about the meaning of those
bytes.

This breaks down in any instance where we start splitting up those bytes
though.  The most obvious case is with any truncation at MAXSTRINGLEN.  This
could lead to an incomplete UTF-8 sequence at the tail.  (Fortunately UTF-8 is
robust enough that this only corrupts one Unicode character in the output.)  We
also have functions like substr() which count by bytes rather than characters.

It's not clear that we can solve this 100%, but if we choose to commit to a
worldview that all strings are utf-8, then we could make and use our own
runtime strlcpy8, strlcat8, etc. functions which preserve boundaries.

Even then, this is preserving only *code points*, whereas one may really have
composite characters with combining diacritical marks and such.  I believe
combining characters are in specific ranges (though new Unicode versions can
expand this), so really fancy runtime functions might preserve these
connections too.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]