Lemonodor: Montezuma Begins

March 16, 2006

Montezuma Begins

Catherine Chalmers, Electrocution

Lucene is a very popular open source search engine library written in Java. Ferret is a Ruby port of Lucene. Montezuma is my in-progress Lisp port of Ferret.

The idea is that if you have a Lisp application in which you'd like to be able to do text-based search, Montezuma will handle it. One search engine flexible enough to be used for project directories, Blog engines, wikis, IRC bots and paste services. Custom match scoring, phrase queries, wildcard queries, word proximity queries, date range searching and fielded (e.g., title, author, date, contents, channel) searching should be significantly more powerful, and easier for users than throwing a regex input field up on a page. If you want, you can write your own tokenizers and document classes.

I don't even remember what burning desire for Lisp-based search functionality originally motivated me to start writing Montezuma, but let the post hoc explanation include the following:

Learning Ruby. It's been interesting comparing the original Ruby version with my Lisp translation.
Learning about the inner workings of search engines. Though I've been surprised at how far I've been able to get without a complete understanding of Lucene-style index management.
Bringing Lucene home to Lisp.

Lucene seems to have Text Database (TDB) in its heritage, which was a Common Lisp text search engine developed at Xerox PARC in the late 80s/early 90s. “An Object-Oriented Architecture for Text Retrieval” is a very nice introduction to the principles on which TDB was built.

This paper presents a software implementation architecture for text retrieval systems that facilitates (a) functional modularization (b) mix-and-match combination of module implementations and (c) definition of inter-module protocols. We show how an object-oriented approach easily accommodates this type of architecture. The design principles are exemplified by code examples in Common Lisp. Taken together these code examples constitute an operational retrieval system. The design principles and protocols implemented have also been instantiated in a large scale retrieval prototype in our research laboratory.

For some reason I'm surprised by the modern style of the code examples in that paper. It's in lower case, and doesn't use tagbody!

(defmethod relevance-search ((app t) query &optional (threshold 10))
  (let ((terms ())
	(scores ()))
    (map-tokens #'(lambda (token) (pushnew token terms :test #'token-string=)) query)
    (dolist (term terms)
      (let ((weight (/ 1.0 (+ (get-term-frequency app (token-image term)) .001))))
	(dolist (freq-pair (get-frequency-postings app (token-image term)))
	  (let* ((id (car freq-pair))
		 (freq (cdr freq-pair))
		 (score-pair (assoc id scores)))
	    (unless score-pair
	      (setq score-pair (cons id 0.0) scores (cons score-pair scores)))
	    (incf (cdr score-pair) (* weight freq))))))
    (mapcar #'car (subseq (sort scores #'> :key #'cdr) 0 (min threshold (length scores))))))

(Speaking of search, the way I found the Lisp file containing the paper's code, which I typed in by hand, was with a little mdfind "org_lisp_definitions == '*relevance-search*'". Maybe that Lisp Spotlight plugin turned out to be somewhat useful, after all.)

Montezuma is probably about 50% complete. Searching (as opposed to parsing documents and updating indexes) and query parsing remain to be written.

It might seem absurd to say that the only part left undone in a search engine is the part that searches, but there's a lot of machinery that had to be put in place first:

lucene architecture

I can't wait to see how it fares performance-wise to the Ruby and Java versions.

Posted by jjwiseman at March 16, 2006 02:59 PM

Comments

I can't help thinking to myself that Lucene is a popular technology that developers with strong language inclinations just *must* port. I mean considerering Lucene, Ferret, Lucene.NET, Beagle, PyLucene, CLucene etc. What is the magic attraction to porting Lucene?

Posted by: Scott Bell on March 16, 2006 04:31 PM

I know! I actually have some ambivalence about doing this port. How many people will use it? What is the value? I guess it's more of an exercise for me than anything.

Posted by: John Wiseman on March 16, 2006 04:38 PM

Because it's fun! Ever since I wrote my own TFxIDF search engine for that class, I've seen various nails for which Lucene sounds like a fine hammer. That recipe project Jim and I wrote was going to be a GOFAI parser, but we realized that a lightweight implementation of a search engine would be more robust and produce results that were as good or better for that particular project.

You're not the only one I know dabbling in Ruby right now. It's on my list -- well, after Python and playing Disgaea.

Posted by: Michael Hannemann on March 16, 2006 05:03 PM

John, this sounds like great fun. More power to you!

Posted by: Gary King on March 16, 2006 05:11 PM

Who would use it? I would! I've been working on a project that requires the kind of full-text search capabilities that Lucene provides. I thought about porting it myself, but didn't due to a lack of time. In the end, I wrote some FLI wrappers around the parts of CLucene that I needed. A Lisp version would be a much nicer solution (and keep me from having to drop into C++ land).

Posted by: Jon Israelson on March 16, 2006 05:17 PM

Great project John! What is the url for svn checkout?

Posted by: Wally Cash on March 16, 2006 05:33 PM

Looks like I'm the first to ask -- what's the story behind the name?

Posted by: Patrick Collison on March 16, 2006 05:52 PM

Mostly I like the call of the Montezuma Oropendola, a Yucatean bird (Lori clued me in to it when I was looking for a new cell phone ringtone--see the front page of the wiki for an mp3).

But I have to admit that its sounding like a powerful Aztec emperor made it a little more fun to work on the project.

Posted by: John Wiseman on March 16, 2006 05:57 PM

I'd realy like to se this app in it's full power. Search is imo everlasting need.

Posted by: Damir on March 16, 2006 11:34 PM

I'd use it for Planet Lisp. Since I have but don't display the full text of every Planet Lisp blog post in the archive, it's hard to google for something you thought you saw on Planet Lisp that one time. Having a local search would be very helpful.

Posted by: Zach Beane on March 17, 2006 06:18 AM

I don't have a public svn repository yet. It's not like the code does anything right now other than run its unit tests (and just browsing the source isn't too hard via the web interface). But yeah, eventually I'll set something up.

Posted by: John Wiseman on March 17, 2006 11:29 AM

John - please keep us Lucene developers posted about your progress, at least for 2 reasons:
1) If you are serious about it, perhaps there is room for Montezuma under Lucene @ Apache.
2) Erik and I are starting to work on Lucene in Action 2.0, which includes a port about Lucene chapters, and it might be good to at least mention, if not cover Montezuma the same way other ports are covered.

Posted by: Otis Gospodnetic on March 17, 2006 04:03 PM

Hi John, until reading this I'd been working on my own port of Lucene to CL. I'm not as far along as you are though, so I'll probably abandon it and find some other project (or maybe go outdoors this weekend for a change.)

Also, "hi" to Otis Gospodnetic and a big thanks for "Lucene in Action". A really nice and lucid book on the subject.

Posted by: Peter Eddy on March 19, 2006 08:13 AM

http://wiki.apache.org/jakarta-lucene/LucyProposal

Doug Cutting, Dave Balmain, Marvin Humphrey have got ASF approval for a C backend to lucene for all dynamic languages (Perl, Ruby, Java, et al.) FYI.

Posted by: Ben on June 4, 2006 08:07 PM

John,

I'm really looking forward to testing your port. Please let me know when something would be available. I'm using LispWorks on Windows (or unix). I'm sure others are anxious to try it as well so please don't think of this as only an exercise!

Thanks,
-- Duane

Posted by: Duane Searsmith on June 19, 2006 11:51 AM

Duane, my plan right now is to finish up query result sorting and then make a nice example or two, then release something. This could happen by the end of the week.

Posted by: John Wiseman on June 19, 2006 12:03 PM