Lemonodor: Unicode in OpenMCL

April 24, 2007

Unicode in OpenMCL

The openmcl-devel mailing list has recently had an educational discussion of different approaches to implementing unicode in Common Lisp, mostly in the threads “The case for UTF-16” and “how many angels can dance on a unicode character?”

Posts like this are one reason that Gary Byers is my favorite Lisp implementer:

My knowledge of ancient Sumero/Akkadian is extremely limited, so I hope that people will forgive me if I lapse into cuneiform here ... I think that what I want to say here can best be summed up by the well-known expression "X, Y, and <CUNEIFORM SIGN A>", where by <CUNIEFORM SIGN A> I'm really referring to the Unicode character with code #x12000, aka #\U+12000.

If I was using a clay tablet and stylus, it'd probably be easier to write that message than it is with a computer (I'm not familiar with Sumero/Akkadian input methods ...). It's not hard to construct that string procedurally in OpenMCL:
? (concatenate 'string "X, Y, and " (string #\U+12000))
If I execute that code in "openmcl64 -K utf-8" running under Terminal.app, I see some mixture of ASCII text and cuneiform. (Well, I would have if I'd remembered to install a cuneiform font.)

If wanted to exchange the first and last characters in that string, I might use something (stupid) like:
(defun exchange-first-and-last-characters (string)
   (let* ((len (length string)))
     (when (> len 1)
       (let* ((temp (char string (1- len))))
         (setf (char string (1- len)) (char string 0)
               (char string 0) temp)))
      string))
and if I printed the result, the cuneiform character would precede the others. (For those who are beginning to suspect as much: yes, this is a contrived example.)

The cuneiform character #\U+12000 isn't any different from the other characters in that string; it's not a CL:STANDARD-CHAR, but it is a CL:BASE-CHAR in OpenMCL. We can treat it pretty much the same way that we'd treat other CHARACTERs: we can store it in strings, ask for its CHAR-CODE, and use CL character functions on it.

[Aside: it is true that some of those CL character functions might give meaningless results. (CL:CHAR< a b) is defined to be true exactly when (< (CHAR-CODE a) (CHAR-CODE b)) is true, and even if that's well-defined, just about any set of assignment of character-codes to characters will give answers that are meaningless/useless for some characters in some locale(s). I don't know if the concept of alphabetic case applied to cuneifrom, but it does apply to other characters that aren't STANDARD-CHARs. Even if CHAR-UPCASE and CHAR-DOWNCASE were extended to apply to all applicable Unicode characters, there are ... cases ... where STRING-UPCASE/STRING-DOWNCASE would need to change the number of characters in their arguments in order to comply with local conventions, and it seems that we need a set of character/string functions that are "useful" and not necessarily equivalent to CL character/string functions.

It's also the case that some characters are intended to be "composed with" adjacent characters (or are the results of such composition. One may need to be aware of this in certain contexts, but I don't think that it's any more meaningul to think of two adjacent combinable characters in a string as being "one character" any more than it is to think of a carriage return followed by a graphic character as being "one character", even though the effect of rendering those characters might be identical to the effect of rendering a single character.]

Back to Sumero/Akkadian: the fact that we can treat this character (#\U+12000) as a first-class object is related to the decision to make CHAR-CODE-LIMIT #x110000 and to use UTF-32/UCS-4 "encoding" internally (e.g., 32-bit strings.) [As another aside, it's not completely out of the question to use a 24-bit string representation, lessening the memory impact a bit.]

If OpenMCL were to use UCS-2 internally (recall that UCS-2 is basically a subset of UTF-16 that doesn't use surrogate pairs), we would have no way of communicating in cuneiform (or at least no way of being understood by other programs if we did so.) We would still have a well-defined notion of a what a CHARACTER was, and we could still access and modify STRINGs in constant/unit time. It'd be slightly easier to copy UCS-2 strings to external UTF-16-encoded memory than it is to copy 32-bit string, but I really, really think that this is fairly far down the list of considerations that should affect the decision of how characters and strings are represented internally. This scheme would take about half the memory for strings as the current scheme does, and I do think that that's an important consideration.

Suppose we were to instead say that - formally or not - these 16-bit strings were really UTF-16-encoded; we could allow the use of surrogate pairs inside 16-bit strings. If we did this "informally", functions like SCHAR would either return true CHARACTER objects or the high or low half of a surrogate pair. Since we aren't inventing a new language, the values returned by CHAR and SCHAR would have to be CHARACTERs, even though they aren't "real": we can't ask ICU or anything else what the uppercase version of such a pseudo-character is in some locale. (This situation is pretty much exactly like what you get with CFString/NSString's characterAtIndex: operations: sometimes, those functions return characters and sometimes they return halves of surrogate pairs.) Does anyone really look at something like this and not see a mess (where the best way to avoid the mess is to only use the intersection of UCS-2 and Unicode ?) To be fair, a lot of whatever mess one sees there is probably there for backward compatibility; I'm sure that NextStep supported Unicode for a long time in the days when all Unicode characters fit in 16 bits, and changing things will break existing code. In the modern (Sumero/Akkadian) world, there are different constraints and issues.

A "formal" use of UTF-16 might recognize that a string is composed of characters (not just 16-bit code elements). Naively, this would mean that things like SCHAR and AREF and LENGTH might need to scan the string from the beginning, treating each non-surrogate-pair element and each surrogate pair as a single (logical) character. It's not hard to think of schemes that cache information about a UTF-16 encoded string that would make these access operations reasonably fast (e.g., cache the logical length in characters as well as the physical length in elements, keep track of whether there are in fact any surrogate pairs in the string, cache the location of some element and character positions so that scans don't have to start at the beginning of the string, probably other things.) I'd assume that programming environments that use UTF-16 internally and provide "sane" access to characters in strings do something like this some of the time.

One way in which CL differs from many other programming environments is the fact that CL strings are mutable ((SETF CHAR), destructive sequence operations, lots of things in the reader and INTERN and elsewhere expect to be able to perform cheap destructive operations on strings.) A destructive operation on a string - changing an ASCII character to a cuneiform character, for instance - might change the number of code elements needed to represent the string's characters in a variable-length encoding like UTF-16. A "simple" (SETF SCHAR) - or something like EXCHANGE-FIRST-AND-LAST-CHARACTERS above - could involve significant memory allocation and copying and rebuilding of some or all of the cached information that makes access viable, and I don't know how to explain how undesirable this is to anyone who says that they want this but to say "no, you don't." (I tried to explain this in the discussion last year; a few days after I did so, someone proposed using UTF-8 internally.) I Sometimes Feel Like I'm Just Not Getting Through To These Kids.

This all leads me to the conclusion that the only really viable options for internal string representation are (a) the current scheme, in which all Unicode 5.0 characters are representable, string operations are cheap and sane, but there's significant memory overhead that could be reduced somewhat by using a 24-bit string type and (b) a 16-bit scheme that would allow the direct representation of "most characters used in modern languages" - equivalent to Unicode 3.x - but which would not allow the representation other characters without creating a lot of confusion and inconsistency. The latter scheme would not allow use to use cunieform (unless we were willing to accept confusion and inconsistency that I don't think we want to accept.)

You might be tempted to say "well, that's fine. Personally, I only use cuneiform in contrived examples; it'd be fine to stick to characters in the Basic Multilingual Plane, and that would offer a significant space saving relative to the current scheme and incidentally make UTF-16 encoding and decoding simpler." I'd agree with that (though I think that the encoding/decoding issue is less significant than other people may believe), and I confess that it's been a long time since I've even thought about printing cuneiform characters in OpenMCL. (Seems like forever, in fact.)

Let's agree that the percentage of possible users intersted in doing cuneiform I/O in OpenMCL ("best thing since a clay tablet!") is small. Other characters that can't be represented in a 16-bit encoding include around 40,000 "mostly historical, but some modern" Chinese ideographs", musical symbols, characters from other historical or obscure languages ... I don't know exactly what the percentage of possible users interested in using some subset of those relatively new (to Unicode) characters is, but I suspect that it's large enough that I don't feel comfortable dismissing potential needs of such users as irrelevant.

Posted by jjwiseman at April 24, 2007 04:45 PM

Comments

How does using utf8 differ from using utf16 ? both need complex handling of 'higher' characters and take up less memory then ucs4.
utf-8 tend to be smaller then utf-16 for alot of uses in the western world, and could scale when unicode grows yet again.

Posted by: Bart on December 23, 2007 02:17 AM