Lemonodor: Low Hanging Montezuma Fruit

June 02, 2006

Low Hanging Montezuma Fruit

From Mr. Jalopy's childhood-in-a-jar.

Testing Montezuma with the Planet Lisp archives:

CL-USER> (defparameter *index*
           (montezuma::index-all-post-files
            (make-instance 'montezuma::index
                           :path (merge-pathnames
                                  (make-pathname :directory '(:relative "postindex"))
                                  montezuma::*corpus-path*)
                           :min-merge-docs 5000)))
Indexed 3819 posts in 96.884 seconds, now optimizing...
Optimization completed in 412.086 seconds.

509 seconds total, for a rate of 7.50 posts/second.

After optimizing a couple I/O routines:

CL-USER> (defparameter *index*
           (montezuma::index-all-post-files
            (make-instance 'montezuma::index
                           :path (merge-pathnames
                                  (make-pathname :directory '(:relative "postindex"))
                                  montezuma::*corpus-path*)
                           :min-merge-docs 5000)))
Indexed 3819 posts in 78.397 seconds, now optimizing... 
Optimization completed in 190.193 seconds.

269 seconds total, for a rate of 14.2 posts/second, which is an 89% speedup.

Posted by jjwiseman at June 02, 2006 04:51 PM

Comments

is there any place where i can checkout montezuma files?

Posted by: alaa on June 3, 2006 03:48 AM

Greetings! So, I am incredibly interested in montezuma, as IR is one of my unfortunately many "fields of interest" (and it's lisp! good god this is great.) But I have a question... you are indexing 3819 posts at 14.2 posts/second... do you have a baseline as to how long it takes to index the same corpus in lucene or the ruby equivalent?

Posted by: ed on June 3, 2006 07:20 AM

John, can you give any details about your I/O optimizations, like contrasting before/after code, how you benchmarked, etc.?

Posted by: michaelw on June 3, 2006 08:21 AM

I just made a couple small, simple changes[1]. Making write-bytes more efficient than the naive call-write-byte-in-a-loop implementation. Changing my default buffer size from 10 to 128 for RAM-based files and 4096 for disk-based files. Like I said, low hanging fruit.

ed: OK, so I wrote some Java code so I could compare against Lucene directly:

Lucene: 3.9 s indexing, 1.4 s optimizing.
Montezuma: 10.7 s indexing, 22.3 s optimizing.

Plenty of room for improvement.

[1] http://projects.heavymeta.org/montezuma/changeset/259

Posted by: John Wiseman on June 3, 2006 09:34 AM

wow, an answer! excellent. well, the actual indexing seems to be small constant factor away, but the optimizing, I'd argue that's an exponential away... is this an algorithm/data structure issue? because it's almost as if somewhere java is using some O(n) where montezuma is using some n^2... i really do want to jump on and start helping, let's exchange emails and see if maybe i can start trying to find hotspots for optimization?

Posted by: ed on June 3, 2006 09:50 PM