March 16, 2006
Catherine Chalmers, Electrocution
Lucene is a very popular open source search engine library written in Java. Ferret is a Ruby port of Lucene. Montezuma is my in-progress Lisp port of Ferret.
The idea is that if you have a Lisp application in which you'd like to be able to do text-based search, Montezuma will handle it. One search engine flexible enough to be used for project directories, Blog engines, wikis, IRC bots and paste services. Custom match scoring, phrase queries, wildcard queries, word proximity queries, date range searching and fielded (e.g., title, author, date, contents, channel) searching should be significantly more powerful, and easier for users than throwing a regex input field up on a page. If you want, you can write your own tokenizers and document classes.
I don't even remember what burning desire for Lisp-based search functionality originally motivated me to start writing Montezuma, but let the post hoc explanation include the following:
- Learning Ruby. It's been interesting comparing the original Ruby version with my Lisp translation.
- Learning about the inner workings of search engines. Though I've been surprised at how far I've been able to get without a complete understanding of Lucene-style index management.
- Bringing Lucene home to Lisp.
Lucene seems to have Text Database (TDB) in its heritage, which was a Common Lisp text search engine developed at Xerox PARC in the late 80s/early 90s. “An Object-Oriented Architecture for Text Retrieval” is a very nice introduction to the principles on which TDB was built.
This paper presents a software implementation architecture for text retrieval systems that facilitates (a) functional modularization (b) mix-and-match combination of module implementations and (c) definition of inter-module protocols. We show how an object-oriented approach easily accommodates this type of architecture. The design principles are exemplified by code examples in Common Lisp. Taken together these code examples constitute an operational retrieval system. The design principles and protocols implemented have also been instantiated in a large scale retrieval prototype in our research laboratory.
For some reason I'm surprised by the modern style of the code examples in that paper. It's in lower case, and doesn't use tagbody!
(defmethod relevance-search ((app t) query &optional (threshold 10))
(let ((terms ())
(map-tokens #'(lambda (token) (pushnew token terms :test #'token-string=)) query)
(dolist (term terms)
(let ((weight (/ 1.0 (+ (get-term-frequency app (token-image term)) .001))))
(dolist (freq-pair (get-frequency-postings app (token-image term)))
(let* ((id (car freq-pair))
(freq (cdr freq-pair))
(score-pair (assoc id scores)))
(setq score-pair (cons id 0.0) scores (cons score-pair scores)))
(incf (cdr score-pair) (* weight freq))))))
(mapcar #'car (subseq (sort scores #'> :key #'cdr) 0 (min threshold (length scores))))))
(Speaking of search, the way I found the Lisp file containing the paper's code, which I typed in by hand, was with a little mdfind "org_lisp_definitions == '*relevance-search*'". Maybe that Lisp Spotlight plugin turned out to be somewhat useful, after all.)
Montezuma is probably about 50% complete. Searching (as opposed to parsing documents and updating indexes) and query parsing remain to be written.
It might seem absurd to say that the only part left undone in a search engine is the part that searches, but there's a lot of machinery that had to be put in place first:
I can't wait to see how it fares performance-wise to the Ruby and Java versions.
Posted by jjwiseman at March 16, 2006 02:59 PM
I can't help thinking to myself that Lucene is a popular technology that developers with strong language inclinations just *must* port. I mean considerering Lucene, Ferret, Lucene.NET, Beagle, PyLucene, CLucene etc. What is the magic attraction to porting Lucene?
I know! I actually have some ambivalence about doing this port. How many people will use it? What is the value? I guess it's more of an exercise for me than anything.
Because it's fun! Ever since I wrote my own TFxIDF search engine for that class, I've seen various nails for which Lucene sounds like a fine hammer. That recipe project Jim and I wrote was going to be a GOFAI parser, but we realized that a lightweight implementation of a search engine would be more robust and produce results that were as good or better for that particular project.
You're not the only one I know dabbling in Ruby right now. It's on my list -- well, after Python and playing Disgaea.
John, this sounds like great fun. More power to you!
Who would use it? I would! I've been working on a project that requires the kind of full-text search capabilities that Lucene provides. I thought about porting it myself, but didn't due to a lack of time. In the end, I wrote some FLI wrappers around the parts of CLucene that I needed. A Lisp version would be a much nicer solution (and keep me from having to drop into C++ land).
Great project John! What is the url for svn checkout?
Looks like I'm the first to ask -- what's the story behind the name?
Mostly I like the call of the Montezuma Oropendola, a Yucatean bird (Lori clued me in to it when I was looking for a new cell phone ringtone--see the front page of the wiki for an mp3).
But I have to admit that its sounding like a powerful Aztec emperor made it a little more fun to work on the project.
I'd realy like to se this app in it's full power. Search is imo everlasting need.
I'd use it for Planet Lisp. Since I have but don't display the full text of every Planet Lisp blog post in the archive, it's hard to google for something you thought you saw on Planet Lisp that one time. Having a local search would be very helpful.
I don't have a public svn repository yet. It's not like the code does anything right now other than run its unit tests (and just browsing the source isn't too hard via the web interface). But yeah, eventually I'll set something up.
John - please keep us Lucene developers posted about your progress, at least for 2 reasons:
1) If you are serious about it, perhaps there is room for Montezuma under Lucene @ Apache.
2) Erik and I are starting to work on Lucene in Action 2.0, which includes a port about Lucene chapters, and it might be good to at least mention, if not cover Montezuma the same way other ports are covered.
Hi John, until reading this I'd been working on my own port of Lucene to CL. I'm not as far along as you are though, so I'll probably abandon it and find some other project (or maybe go outdoors this weekend for a change.)
Also, "hi" to Otis Gospodnetic and a big thanks for "Lucene in Action". A really nice and lucid book on the subject.
I'm really looking forward to testing your port. Please let me know when something would be available. I'm using LispWorks on Windows (or unix). I'm sure others are anxious to try it as well so please don't think of this as only an exercise!
Duane, my plan right now is to finish up query result sorting and then make a nice example or two, then release something. This could happen by the end of the week.