May 23, 2006
Montezuma Charges Along

taking blubber samples for DNA analysis using hollow-tipped arrows
Photo by Peter Menzel

Montezuma is coming along pretty well. Indexing is complete and seems mostly bug free, and Gary King and I are finishing off the support for searches. We've got simple term queries, phrase queries, wildcard queries, range queries and boolean queries pretty much working.

For testing, I grabbed 1000 pastes from the paste service at paste.lisp.org. Each paste consists of a number (the unique paste ID), the user who made the paste, the date, the IRC channel, the title of the paste, the contents of the paste, and any annotations that may have been added later (I ignored the annotations).

I built a list in memory containing 1000 paste structs with the information listed above. Given that, the following code will allow you to create a Montezuma index on disk:

(defun index-pastes (pastes)
  (let ((index (make-instance 'index
			      :path (merge-pathnames
				     (make-pathname :directory '(:relative "pasteindex"))
				     *corpus-path*)
			      :default-field "contents")))
  (dolist (paste pastes)
    (index-paste index paste))
  (optimize index)
  index))

(defun index-paste (index paste)
  (let ((doc (make-instance 'document)))
    (let ((number   (make-field "id"   (format nil "~S" (paste-number paste))
				:index :untokenized :stored T))
	  (user     (make-field "user"     (paste-user paste)
				:index :untokenized :stored T))
	  (date     (make-field "date"     (date-string (paste-date paste))
				:index :untokenized :stored T))
	  (channel  (make-field "channel"  (paste-channel paste)
				:index :untokenized :stored T))
	  (title    (make-field "title"    (paste-title paste)
				:index :tokenized :stored T))
	  (contents (make-field "contents" (paste-contents paste)
				:stored NIL :index :tokenized)))
      (add-field doc number)
      (add-field doc user)
      (add-field doc date)
      (add-field doc channel)
      (add-field doc title)
      (add-field doc contents)
      (add-document-to-index *paste-index* doc))))

(defun date-string (universal-time)
  (multiple-value-bind (second minute hour date month year)
      (decode-universal-time universal-time)
    (format nil "~D-~2,'0D-~2,'0D ~2,'0D:~2,'0D:~2,'0D"
	    year month date hour minute second)))

Currently, indexing the 1000 pastes (about 2 MB of data) takes about 100 seconds on a 2.4 GHz P4. Once you've done that, you'll want to be able to perform queries against the index.

The search-pastes function below makes it convenient to perform simple queries. You can pass it the field to query along with a query string or list of strings, which will be turned into a query object. A list of strings will become a boolean query where every term must appear. Wildcards (“*”, “?”) can be used in any query terms.

If you pass a pre-built query object as the second argument it will be used unchanged, and the field argument will be ignored.

(defun search-pastes (field query &optional options)
  (etypecase query
    (list
     ;; Make a boolean query where each clause is a wildcard query
     ;; that MUST occur.
     (let ((words query))
       (setf query (make-instance 'boolean-query))
       (dolist (word words)
	 (add-query query
		    (make-instance 'wildcard-query
				   :term (make-term field word))
		    :must-occur))))
    (string
     ;; Make a single-term wildcard query.
     (let ((word query))
       (setf query (make-instance 'wildcard-query
				  :term (make-term field word)))))
    (query
     ;; Don't need to do anything, use it as-is.
     ))
  ;; Perform the search
  (let ((num-results 0))
    (search-each *paste-index* query
		 #'(lambda (doc score)
		     (when (= num-results 0)
		       (format T "~&~5A ~19A ~5A ~15A" "Score" "Date" "#" "User")
		       (format T "~&-------------------------------------------"))
		     (incf num-results)
		     (print-result doc score))
		 options)
    (format T "~&~%~S results displayed." num-results)))

(defun print-result (doc score)
  (let ((paste (get-document *paste-index* doc)))
    (format T "~&~5,2F ~A ~A ~15A~&~vt~A"
	    score
	    (field-data (document-field paste "date"))
	    (field-data (document-field paste "id"))
	    (field-data (document-field paste "user"))
	    10
	    (field-data (document-field paste "title")))))

Here's a simple first example, which finds pastes I pasted:

MONTEZUMA> (search-pastes "user" "lemonodor")
Score Date                #     User           
-------------------------------------------
 6.30 2006-05-08 18:53:03 19817 lemonodor      
          opaque sbcl warning
 6.30 2006-05-05 16:07:11 19726 lemonodor      
          primitive montezuma example
 6.30 2006-04-25 21:07:59 19371 lemonodor      
          flet declarations
 6.30 2006-04-14 12:04:38 18990 lemonodor      
          search-and-replace

4 results displayed.

Now we'll search for pastes with titles containing any word with “sbcl” as a substring, which demonstrates wildcard queries:

MONTEZUMA> (search-pastes "title" "*sbcl*")
Score Date                #     User           
-------------------------------------------
 2.88 2006-04-15 18:19:31 19045 malsyned__     
          My sbclrc
 1.44 2006-04-22 15:37:08 19253 Shine          
          incrementally building sbcl.exe still doesn't work "the party is over"
 1.31 2006-04-21 21:42:52 19239 Shine          
          sbcl backtrace
 1.31 2006-04-20 21:47:40 19212 technomancy    
          sbcl error
 1.05 2006-05-08 18:53:03 19817 lemonodor      
          opaque sbcl warning
 1.05 2006-05-04 23:28:53 19701 interferon     
          SBCL disassembly example
 1.05 2006-05-04 23:28:35 19700 interferon     
          SBCL disassembly example
 1.05 2006-04-30 17:27:51 19551 dabaR          
          Error with SBCL
 1.05 2006-04-27 12:00:59 19428 death          
          sbcl 0.9.11 (room) bugged
 1.05 2006-04-27 03:34:17 19419 arbscht        
          sbcl sarge24 deb failure

10 results displayed.

There are more than 10 pastes that match the above query, but by default only the top 10 results are returned. Let's find all of them:

MONTEZUMA> (search-pastes "title" "*sbcl*" '(:num-docs 1000))
Score Date                #     User           
-------------------------------------------
 2.88 2006-04-15 18:19:31 19045 malsyned__     
          My sbclrc
 1.44 2006-04-22 15:37:08 19253 Shine          
          incrementally building sbcl.exe still doesn't work "the party is over"

[...]

 0.65 2006-04-10 07:24:06 18808 antifuchs      
          gna. nested errors with sbcl 0.9.11.26 and slime
 0.52 2006-04-17 10:12:04 19083 tcr            
          compiling latest SWANK with flags (speed 0) (safety 3) (debug 3) w/ SBCL 0.9.11

22 results displayed.

This query matches pastes containing both “sbcl” and “bug*” from May 2006 on (an example of a range query):

MONTEZUMA> (let ((q (make-instance 'boolean-query)))
	     (add-query q (make-instance 'term-query
					 :term (make-term "contents" "sbcl"))
			:must-occur)
	     (add-query q (make-instance 'wildcard-query
					 :term (make-term "contents" "bug*"))
			:must-occur)
	     (add-query q (make-instance 'range-query
					 :field "date"
					 :lower-term "2006-05"
					 :upper-term nil
					 :include-lower-p T
					 :include-upper-p NIL)
			:must-occur)
	     (search-pastes nil q))
Score Date                #     User           
-------------------------------------------
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build
 0.49 2006-05-09 13:36:33 19832 tritchey       
          this is from a clean build

6 results displayed.

Oops. Looks like there's a bug that's giving us duplicate results. Sorry!

Now look for pastes containing “sbcl” made by someone whose user name does not begin with a “t”:

MONTEZUMA> (let ((q (make-instance 'boolean-query)))
	     (add-query q (make-instance 'term-query
					 :term (make-term "contents" "sbcl"))
			:must-occur)
	     (add-query q (make-instance 'wildcard-query
					 :term (make-term "user" "t*"))
			:must-not-occur)
	     (search-pastes nil q))
Score Date                #     User           
-------------------------------------------
 0.87 2006-04-27 03:34:17 19419 arbscht        
          sbcl sarge24 deb failure
 0.80 2006-04-23 14:38:32 19290 antifuchs      
          mp and fix
 0.75 2006-04-28 11:46:44 19469 Xof            
          see also http://www-jcsu.jesus.cam.ac.uk/~csr21/ed-in-climacs.png
 0.57 2006-04-22 15:37:08 19253 Shine          
          incrementally building sbcl.exe still doesn't work "the party is over"
 0.53 2006-04-26 19:43:19 19408 pfdietz        
          misc.640
 0.53 2006-04-21 18:25:36 19238 waddletron2k   
          my .emacs
 0.53 2006-04-14 16:30:07 18997 jsnell         
          code for lemonodor
 0.52 2006-04-19 17:41:25 19160 slyrus         
          tinaa docs aver any-keyp error
 0.46 2006-04-28 09:02:10 19462 nyef           
          I can't tell if this is progress or not.
 0.46 2006-04-19 15:30:26 19158 Shine          
          problem with SBCL on Windows, compiled with Cygwin

10 results displayed.

For the last example, we'll create a phrase query corresponding to “svn co”:

MONTEZUMA> (defparameter *q*
	     (make-instance 'phrase-query))
*Q*
MONTEZUMA> (add-term-to-query *q* (make-term "contents" "svn"))
#<PHRASE-QUERY field:"contents" terms: "svn":0 >
MONTEZUMA> (add-term-to-query *q* (make-term "contents" "co"))
#<PHRASE-QUERY field:"contents" terms: "svn":0 "co":1 >
MONTEZUMA> (search-pastes nil *q*)
; Evaluation aborted
MONTEZUMA> *q*
#<PHRASE-QUERY field:"contents" terms: "svn":0 "co":1 >
MONTEZUMA> (search-pastes nil *q*)
There is no applicable method for the generic function
  #<STANDARD-GENERIC-FUNCTION CLOSE (32)>
when called with arguments
  (NIL).
   [Condition of type SIMPLE-ERROR]

Well, it seems that despite passing some unit tests, phrase queries are falling down on this larger index. Time to fire up emacs and fix that.

Posted by jjwiseman at May 23, 2006 07:30 PM
Comments

It's looking great so far! Awesome.

2 MB of data in a minute and a half... eh, okay. Is it indexing the entire paste, or just the first n words? A common search engine technique is to only index the beginning of a document, on the assumption that if the doc is about that thing, it'll come up in the beginning.


On a side note, I'm using SVN now at this place I'm consulting at, and it's great. I love the built-in wiki, bug tracking, code review tools, &c.

Posted by: Michael Hannemann on May 23, 2006 08:11 PM

Yeah, the indexing speed isn't too great yet. I'm pretty sure there's a lot of low-hanging optimization fruit, though.

I think the default is to index the first 10000 terms in a document--I had the limit set to 1000 and noticed that there were a couple pastes that went over that limit.

Posted by: John Wiseman on May 23, 2006 11:43 PM

Out of curiousity, how much memory do you have on that P4? Not that it really matters - optimization is definitely not the first thing to work on. =)

Like I said, this looks great! And I like the full-circle nature of the project, too.

Posted by: Michael Hannemann on May 24, 2006 06:04 AM

IMHO the following would be more DRY:

(add-fields doc number
user
date
channel
title
contents)

anyway, its a great job.

keep the H.W.

Posted by: alaa on May 24, 2006 12:03 PM
Post a comment
Name:


Email Address:


URL:




Unless you answer this question, your comment will be classified as spam and will not be posted.
(I'll give you a hint: the answer is “lisp”.)

Comments:


Remember info?