Lemonodor: Validation

September 21, 2002

Validation

&&&&&&&&&&&&

You know what is particularly dumb? That many (and some days it feels like possibly most) url strings are not valid html. What the hell was the W3C thinking?

At least 75% of the html errors that slip by me when I write a post but get caught by the validator are due to pasted urls containing '&', and it's the W3C's fault.

Let's go for a chicken ride
Photo by Harry Whittier Frees

Posted by jjwiseman at September 21, 2002 08:52 AM

Comments

Exactly!

Being forced to subsititue & for & is the *most* retarded thing about XML (technically this isn't valid HTML either).

The rationale for this is that & is for entities. Talk about lazy parsing. First of all, entities are stupid in and of themselves. There are better solutions. Second, there is a required syntax for entities. They have to be amperstand + [some number of characters] + semicolon. So a plain amperstand should not break this rule. An amperstand next to some characters not followed by a semicolon should not break this rule. An amperstand which is next to some characters and is followed by a semilcolon but which refers to an undefined entity should not meet this rule.

Having to escape > and < is similarily stupid, though somewhat harder to work around.

Posted by: Luke Francl on September 21, 2002 01:27 PM

A troll but still a troll.

The problem is not standard wise in this case but.... implementation related. Why your implementation (the software you are using) is not able to escape these characters when you copy and paste.

When you do a perl program, you have reserved characters too... like # for example.

Posted by: karl on April 2, 2003 04:18 AM

The problem isn't that there *are* characters with special meanings ("reserved", as you put it), it's the W3C's *choice* of characters. They decided that ampersands would have a special meaning and would often be used in the mini-language used to write URLs (which is used primarily embedded in HTML), and they also decided that ampersands would have a different special meaning and be commonly used in HTML. They designed the conflict right in from the start.

It's a little like choosing the space character as an escape character in string literals for a programming language--it's a bad choice that needlessly complicates things when trying to embed the commonly used category of mini-languages known as "most human natural languages" into those string literals.

"Know\ what\ I\ mean?"

Your solution of having (heuristic) workarounds in every piece of software that uses HTML and URLs is... unsatisfying. I know you just clicked on a link and found this page at random, but just try to think a little bit before going to the trouble of commenting.

Posted by: John Wiseman on April 3, 2003 10:10 AM