Testing Montezuma with the Planet Lisp archives:
CL-USER> (defparameter *index* (montezuma::index-all-post-files (make-instance 'montezuma::index :path (merge-pathnames (make-pathname :directory '(:relative "postindex")) montezuma::*corpus-path*) :min-merge-docs 5000))) Indexed 3819 posts in 96.884 seconds, now optimizing... Optimization completed in 412.086 seconds.
509 seconds total, for a rate of 7.50 posts/second.
After optimizing a couple I/O routines:
CL-USER> (defparameter *index* (montezuma::index-all-post-files (make-instance 'montezuma::index :path (merge-pathnames (make-pathname :directory '(:relative "postindex")) montezuma::*corpus-path*) :min-merge-docs 5000))) Indexed 3819 posts in 78.397 seconds, now optimizing... Optimization completed in 190.193 seconds.
269 seconds total, for a rate of 14.2 posts/second, which is an 89% speedup.
Posted by jjwiseman at June 02, 2006 04:51 PMis there any place where i can checkout montezuma files?
Posted by: alaa on June 3, 2006 03:48 AMGreetings! So, I am incredibly interested in montezuma, as IR is one of my unfortunately many "fields of interest" (and it's lisp! good god this is great.) But I have a question... you are indexing 3819 posts at 14.2 posts/second... do you have a baseline as to how long it takes to index the same corpus in lucene or the ruby equivalent?
Posted by: ed on June 3, 2006 07:20 AMJohn, can you give any details about your I/O optimizations, like contrasting before/after code, how you benchmarked, etc.?
I just made a couple small, simple changes[1]. Making write-bytes more efficient than the naive call-write-byte-in-a-loop implementation. Changing my default buffer size from 10 to 128 for RAM-based files and 4096 for disk-based files. Like I said, low hanging fruit.
ed: OK, so I wrote some Java code so I could compare against Lucene directly:
Lucene: 3.9 s indexing, 1.4 s optimizing.
Montezuma: 10.7 s indexing, 22.3 s optimizing.
Plenty of room for improvement.
[1] http://projects.heavymeta.org/montezuma/changeset/259
Posted by: John Wiseman on June 3, 2006 09:34 AMwow, an answer! excellent. well, the actual indexing seems to be small constant factor away, but the optimizing, I'd argue that's an exponential away... is this an algorithm/data structure issue? because it's almost as if somewhere java is using some O(n) where montezuma is using some n^2... i really do want to jump on and start helping, let's exchange emails and see if maybe i can start trying to find hotspots for optimization?
Posted by: ed on June 3, 2006 09:50 PM