January 18, 2005
Always With The Fucking Ampersands

Christ. Why is it always the ampersands?

It started when I noticed that one of my posts looked a little funny in Planet Lisp.

Patrick Collison, 16, of Castletroy College beat off a record number of entries in the 2005 competition with his winning project, ?CROMA: a new dialect of LISP?.

Why the question marks? My HTML:

Patrick Collison, 16, of Castletroy College beat off a record number of entries in the 2005 competition with his winning project, “CROMA: a new dialect of LISP”.

That looks cool. So I figured it was a bad link in the chain of tools Zach uses to generate Planet Lisp. But then I checked my XML and noticed that it wasn't right.

Patrick Collison, 16, of Castletroy College beat off a record number of entries in the 2005 competition with his winning project, “CROMA: a new dialect of LISP”.

It should have looked like this:

Patrick Collison, 16, of Castletroy College beat off a record number of entries in the 2005 competition with his winning project, “CROMA: a new dialect of LISP”.

I tracked it down to what I think is a bug in Movable Type. From Util.pm:

## Encode any & not followed by something that looks like
## an entity, numeric or otherwise.
$html =~ s/&(?!#?[xX]?(?0-9a-fA-F]+|\w{1,8})/&/g;

Apparently this behavior was intentional. Intentionally buggy. Maybe they had some reason for not escaping ampersands (if I had to guess, they did it to appease buggy RSS readers), but it's wrong. Incorrect handling of ampersands in order to workaround buggy handling of ampersands.

I don't know Perl, but it wasn't hard to fix:

--- lib/MT/Util.pm.orig	Wed May 28 09:48:41 2003
+++ lib/MT/Util.pm	Wed Jan 19 01:54:08 2005
@@ -249,13 +249,7 @@
         if ($Have_Entities && !MT::ConfigMgr->instance->NoHTMLEntities) {
             $html = HTML::Entities::encode_entities($html);
         } else {
-            if ($can_double_encode) {
-                $html =~ s!&!&!g;
-            } else {
-                ## Encode any & not followed by something that looks like
-                ## an entity, numeric or otherwise.
-                $html =~ s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/&/g;
-            }
+            $html =~ s!&!&!g;
             $html =~ s!"!"!g;
             $html =~ s!<!&lt;!g;
             $html =~ s!>!&gt;!g;

Then I thought, as I so often do, “Hey, I'm going to post this to lemonodor! I'll post a diff and it'll look hardcore technical and cool and make people think I actually write code sometimes.” My first thought was to IM the diff from the machine I was on at the moment to my powerbook, which has my preferred weblog editing tool. But then I remembered that gaim doesn't handle encoding so very well itself, expecting users to be chatting with each other in HTML (besides this questionable design choice, gaim has specifically had problems with ampersands in urls sent to it--and the gaim team's approach to handling these bug reports was to simply mark them as invalid the next time they released a new version without checking whether it was fixed).

i hate gaim, but i use it anyway
sent to ichat by gaim, notice the “>!>”, etc.

My next clever idea was to try to use the Pasta paste service. Unfortunately that didn't work, either.

That's when I went insane. HTML isn't that hard.

Today's bonus: a patch that fixes CLiki's handling of HTML:

Index: edit-handler.lisp
===================================================================
RCS file: /cvs/cliki/edit-handler.lisp,v
retrieving revision 1.25
diff -u -r1.25 edit-handler.lisp
--- edit-handler.lisp   1 Nov 2004 11:20:05 -0000       1.25
+++ edit-handler.lisp   22 Nov 2004 07:48:34 -0000
@@ -50,7 +50,10 @@
                           element-num)
                   (format stream "<TEXTAREA rows=~A cols=80 name=E~A>~%"
                           (min 15 (floor (* cr 1.5))) element-num)
-                  (write-sequence buf stream)
+                  (let ((str (make-instance 'escape-html-stream
+                                            :output-stream stream)))
+                    (write-sequence buf str)
+                    (force-output str))
                   (format stream "</TEXTAREA>~%")
                   (incf element-num) )))
             (output (c) (write-char c acc-stream)))

I guess you'll need the implementation of escape-html-stream, too.

This post is geeky, annoying and curmudgeonly. No apologies.

P.S. Guess what. When I first posted this article, everything looked great. Then I edited it to add the line above. You know what happened? Someone, either ecto or (more likely) Movable Type FUCKED UP ALL MY AMPERSANDS, forcing me to go back an re-en-fucking-code them again. And there's a lot of ampersands in this post. In going back to fix them, I probably missed some and others probably were fucked up by this software. If I'm really lucky, my patch introduced some new mode of ampersand fuckup.

Posted by jjwiseman at January 18, 2005 06:18 PM
Comments

LOL. This post was really fucking funny.

Posted by: geoff on January 18, 2005 06:39 PM

Dude, bitten by both interpreted smilies in IM and ugly blogging software. Curses!

This post is geeky, annoying and curmudgeonly. No apologies.

That reminds me, I miss reading USENET...

Posted by: Michael Hannemann on January 18, 2005 08:06 PM

It would be fun to go through the chain of things to fix. I knew someone whose life's goal was to carry a screwdriver wherever he went and break the system down gradually, and you'd probably have a nice rivalry going when you eventually meet him.

Posted by: Tayssir John Gabbour on January 18, 2005 09:24 PM

I feel your pain.

Posted by: Michael Hudson on January 19, 2005 05:51 AM

On the bright side, you provided another test case for NewsGator's RSS parser!

Posted by: Gordon Weakliem on January 19, 2005 08:43 AM

This stuff is laying waste to best minds of our generation...

http://intertwingly.net/blog/2004/10/20/Attractive-Nuisance/

Posted by: Ben Hyde on January 19, 2005 08:51 AM
Post a comment
Name:


Email Address:


URL:




Unless you answer this question, your comment will be classified as spam and will not be posted.
(I'll give you a hint: the answer is “lisp”.)

Comments:


Remember info?