February 14, 2007

crossing the ohio near louisville

Earlier today when I was searching for discussion of the Evolution Robotics deal I found something weird at Technorati. The first page of search results for “ERSP” is filled with German pages that don't contain the term ERSP.

But the posts do contain “erspüren”, “erspäht”, “erspähe”, “erspüre”, “erspähen”, etc.

No, they're not doing substring matching. Yes, they seem to be tokenizing words very poorly for any language that isn't using straight ASCII.

Neverending shoddiness over there.

Later: Ha! Sphere tokenizes in the same wrong way.

Posted by jjwiseman at February 14, 2007 12:55 AM

Hahah. I laugh at their inability to deal with codes near the "alt-666" range.

Ü rÜlz!

Posted by: token on February 15, 2007 11:38 PM

I also love how the title above the photo here is splitting the ß.

Is that the font or my browser? I was able to paste it in from there back to the solid "ß". Oh, Unicode, I lurve you.

Posted by: É®ßπ on February 15, 2007 11:43 PM

It's unicode & german language, the correct capitalization of ß is SS.

Posted by: Engelke on February 16, 2007 05:09 AM

Yes, but the point is the different fonts "split" it or not, even when it is capital in both.

Posted by: split-ar on March 23, 2007 09:47 AM
Post a comment

Email Address:


Unless you answer this question, your comment will be classified as spam and will not be posted.
(I'll give you a hint: the answer is “lisp”.)


Remember info?