Lemonodor: April 2007 Archives

April 28, 2007

Apex Electronics and Norton Sales: Adventures in Space Surplus

Today Maciej and I visited two surplus stores, Norton Sales and Apex Electronics, looking for relics of space travel.

We found an Apollo rocket motor, autopilots, a bee nest inside a vacuum chamber, huge gauges, a glove box, jet canopies, a satellite, napalm bombs, and JoAnne.

Update: Maciej's photos are online.

Posted by jjwiseman at 11:06 PM | Comments (0) | TrackBack

April 26, 2007

Spam Attack

“Can robots use chopsticks?”

Comments are only sometimes enabled right now due to spam attack. Sorry, I'm working on it.

Posted by jjwiseman at 05:28 PM | Comments (0) | TrackBack

April 25, 2007

Arc: Now Conses Less

Arc improvements for a faster news.ycombinator.com [via gavin and LtU]:

We just launched a new version that is 2-3x faster, thanks to Robert Morris, who rewrote some of the innards of Arc not to cons so much.

PG says “It ‘compiles’ into mzscheme. But the compilation is more like macroexpansion.”

I used to think people working on new Lisp-like languages were misguided, but after internalizing the idea that Common Lisp is static, and therefore dead, I'm a lot more sympathetic.

Posted by jjwiseman at 01:53 PM | Comments (1) | TrackBack

April 24, 2007

Unicode in OpenMCL

The openmcl-devel mailing list has recently had an educational discussion of different approaches to implementing unicode in Common Lisp, mostly in the threads “The case for UTF-16” and “how many angels can dance on a unicode character?”

Posts like this are one reason that Gary Byers is my favorite Lisp implementer:

My knowledge of ancient Sumero/Akkadian is extremely limited, so I hope that people will forgive me if I lapse into cuneiform here ... I think that what I want to say here can best be summed up by the well-known expression "X, Y, and <CUNEIFORM SIGN A>", where by <CUNIEFORM SIGN A> I'm really referring to the Unicode character with code #x12000, aka #\U+12000.

If I was using a clay tablet and stylus, it'd probably be easier to write that message than it is with a computer (I'm not familiar with Sumero/Akkadian input methods ...). It's not hard to construct that string procedurally in OpenMCL:
? (concatenate 'string "X, Y, and " (string #\U+12000))
If I execute that code in "openmcl64 -K utf-8" running under Terminal.app, I see some mixture of ASCII text and cuneiform. (Well, I would have if I'd remembered to install a cuneiform font.)

If wanted to exchange the first and last characters in that string, I might use something (stupid) like:
(defun exchange-first-and-last-characters (string)
   (let* ((len (length string)))
     (when (> len 1)
       (let* ((temp (char string (1- len))))
         (setf (char string (1- len)) (char string 0)
               (char string 0) temp)))
      string))
and if I printed the result, the cuneiform character would precede the others. (For those who are beginning to suspect as much: yes, this is a contrived example.)

The cuneiform character #\U+12000 isn't any different from the other characters in that string; it's not a CL:STANDARD-CHAR, but it is a CL:BASE-CHAR in OpenMCL. We can treat it pretty much the same way that we'd treat other CHARACTERs: we can store it in strings, ask for its CHAR-CODE, and use CL character functions on it.

[Aside: it is true that some of those CL character functions might give meaningless results. (CL:CHAR< a b) is defined to be true exactly when (< (CHAR-CODE a) (CHAR-CODE b)) is true, and even if that's well-defined, just about any set of assignment of character-codes to characters will give answers that are meaningless/useless for some characters in some locale(s). I don't know if the concept of alphabetic case applied to cuneifrom, but it does apply to other characters that aren't STANDARD-CHARs. Even if CHAR-UPCASE and CHAR-DOWNCASE were extended to apply to all applicable Unicode characters, there are ... cases ... where STRING-UPCASE/STRING-DOWNCASE would need to change the number of characters in their arguments in order to comply with local conventions, and it seems that we need a set of character/string functions that are "useful" and not necessarily equivalent to CL character/string functions.

It's also the case that some characters are intended to be "composed with" adjacent characters (or are the results of such composition. One may need to be aware of this in certain contexts, but I don't think that it's any more meaningul to think of two adjacent combinable characters in a string as being "one character" any more than it is to think of a carriage return followed by a graphic character as being "one character", even though the effect of rendering those characters might be identical to the effect of rendering a single character.]

Back to Sumero/Akkadian: the fact that we can treat this character (#\U+12000) as a first-class object is related to the decision to make CHAR-CODE-LIMIT #x110000 and to use UTF-32/UCS-4 "encoding" internally (e.g., 32-bit strings.) [As another aside, it's not completely out of the question to use a 24-bit string representation, lessening the memory impact a bit.]

If OpenMCL were to use UCS-2 internally (recall that UCS-2 is basically a subset of UTF-16 that doesn't use surrogate pairs), we would have no way of communicating in cuneiform (or at least no way of being understood by other programs if we did so.) We would still have a well-defined notion of a what a CHARACTER was, and we could still access and modify STRINGs in constant/unit time. It'd be slightly easier to copy UCS-2 strings to external UTF-16-encoded memory than it is to copy 32-bit string, but I really, really think that this is fairly far down the list of considerations that should affect the decision of how characters and strings are represented internally. This scheme would take about half the memory for strings as the current scheme does, and I do think that that's an important consideration.

Suppose we were to instead say that - formally or not - these 16-bit strings were really UTF-16-encoded; we could allow the use of surrogate pairs inside 16-bit strings. If we did this "informally", functions like SCHAR would either return true CHARACTER objects or the high or low half of a surrogate pair. Since we aren't inventing a new language, the values returned by CHAR and SCHAR would have to be CHARACTERs, even though they aren't "real": we can't ask ICU or anything else what the uppercase version of such a pseudo-character is in some locale. (This situation is pretty much exactly like what you get with CFString/NSString's characterAtIndex: operations: sometimes, those functions return characters and sometimes they return halves of surrogate pairs.) Does anyone really look at something like this and not see a mess (where the best way to avoid the mess is to only use the intersection of UCS-2 and Unicode ?) To be fair, a lot of whatever mess one sees there is probably there for backward compatibility; I'm sure that NextStep supported Unicode for a long time in the days when all Unicode characters fit in 16 bits, and changing things will break existing code. In the modern (Sumero/Akkadian) world, there are different constraints and issues.

A "formal" use of UTF-16 might recognize that a string is composed of characters (not just 16-bit code elements). Naively, this would mean that things like SCHAR and AREF and LENGTH might need to scan the string from the beginning, treating each non-surrogate-pair element and each surrogate pair as a single (logical) character. It's not hard to think of schemes that cache information about a UTF-16 encoded string that would make these access operations reasonably fast (e.g., cache the logical length in characters as well as the physical length in elements, keep track of whether there are in fact any surrogate pairs in the string, cache the location of some element and character positions so that scans don't have to start at the beginning of the string, probably other things.) I'd assume that programming environments that use UTF-16 internally and provide "sane" access to characters in strings do something like this some of the time.

One way in which CL differs from many other programming environments is the fact that CL strings are mutable ((SETF CHAR), destructive sequence operations, lots of things in the reader and INTERN and elsewhere expect to be able to perform cheap destructive operations on strings.) A destructive operation on a string - changing an ASCII character to a cuneiform character, for instance - might change the number of code elements needed to represent the string's characters in a variable-length encoding like UTF-16. A "simple" (SETF SCHAR) - or something like EXCHANGE-FIRST-AND-LAST-CHARACTERS above - could involve significant memory allocation and copying and rebuilding of some or all of the cached information that makes access viable, and I don't know how to explain how undesirable this is to anyone who says that they want this but to say "no, you don't." (I tried to explain this in the discussion last year; a few days after I did so, someone proposed using UTF-8 internally.) I Sometimes Feel Like I'm Just Not Getting Through To These Kids.

This all leads me to the conclusion that the only really viable options for internal string representation are (a) the current scheme, in which all Unicode 5.0 characters are representable, string operations are cheap and sane, but there's significant memory overhead that could be reduced somewhat by using a 24-bit string type and (b) a 16-bit scheme that would allow the direct representation of "most characters used in modern languages" - equivalent to Unicode 3.x - but which would not allow the representation other characters without creating a lot of confusion and inconsistency. The latter scheme would not allow use to use cunieform (unless we were willing to accept confusion and inconsistency that I don't think we want to accept.)

You might be tempted to say "well, that's fine. Personally, I only use cuneiform in contrived examples; it'd be fine to stick to characters in the Basic Multilingual Plane, and that would offer a significant space saving relative to the current scheme and incidentally make UTF-16 encoding and decoding simpler." I'd agree with that (though I think that the encoding/decoding issue is less significant than other people may believe), and I confess that it's been a long time since I've even thought about printing cuneiform characters in OpenMCL. (Seems like forever, in fact.)

Let's agree that the percentage of possible users intersted in doing cuneiform I/O in OpenMCL ("best thing since a clay tablet!") is small. Other characters that can't be represented in a 16-bit encoding include around 40,000 "mostly historical, but some modern" Chinese ideographs", musical symbols, characters from other historical or obscure languages ... I don't know exactly what the percentage of possible users interested in using some subset of those relatively new (to Unicode) characters is, but I suspect that it's large enough that I don't feel comfortable dismissing potential needs of such users as irrelevant.

Posted by jjwiseman at 04:45 PM | Comments (1) | TrackBack

April 23, 2007

Lisp and Python Syntax

Just typing the title above is making me preemptively cringe at the idiots it's going to attract.

Amit on Lisp and Python syntax:

Lisp seems to be optimized for writing code; Python seems to be optimized for reading it. Which you prefer may depend on how often you write new code vs. read unfamiliar code; I'm not entirely sure. What bothers me the most though is not that these two languages do different things, but that the people who argue about it seem to think that there is one “best” answer, and don't see that this is a tradeoff. When I'm writing code I prefer Lisp; when I'm reading code I prefer Python. I think this is an inherent tradeoff—any added flexibility for the writer means an added burden for the reader, and there is no answer that will be right for everyone.

I think even after 13 years of doing Lisp and 3 or 4 years of Python, I agree: I prefer writing Lisp, but Python is easier to read. Which is not necessarily a bad thing (I completely agree with Amit on how annoying it is when people see these things as absolutes instead of tradeoffs).

Posted by jjwiseman at 10:29 AM | Comments (4) | TrackBack

April 21, 2007

Esotouric

Project Moonbase is Heinlein's other 50s sci fi movie.

My friends Richard (CRACL organizer, among other things) and Kim (1947 Projector and popular/unpopular music writer) have started a strange bus tour company, Esotouric.

Our routes veer off into fascinating, neglected neighborhoods. Our expert guides are passionate, brainy and hilarious. Our tour themes are provocative and complex, but never dry, mixing crime and social history, rock and roll and architecture, literature and film, fine art and urban studies into a simmering stew of original research and startling observations.

Even our snack stops are unique: a Chinese dumpling picnic in a garden of concrete sea monsters, homemade mint lemonade and cookies at the site of the first UFO sighting in the Southland, or Black Dahlia and Black Death flavored gelato at Scoops in East Hollywood.

When you climb aboard for an Esotouric bus adventure, you're guaranteed an intelligent, unpredictable ride into the secret heart of the city we love. These tours are recommended for natives, tourists and anyone who likes to dig a little deeper and discover the world beyond the everyday. Come ride and see for yourself.

Posted by jjwiseman at 05:17 PM | Comments (0) | TrackBack

April 05, 2007

Lisp on Metafilter

Metafilter had a Lisp thread yesterday.

Posted by jjwiseman at 01:40 AM | Comments (1) | TrackBack

April 04, 2007

The ILC Killed My Baby

Every studio/office should have (i) a globe, (ii) a drafting table, and (iii) a metal model of an airplane on a stand.

Lemonodor has been down since Sunday night, and we're glad to be back.

Posted by jjwiseman at 04:26 PM | Comments (0) | TrackBack