Tuesday, January 17, 2012

Whoosh Search Indexing

I've been using Apache Lucene to build e-commerce search engines for the last six years.  The search engines are built using the Java version of Lucene and include quite a bit of custom functionality for filtering, blocking, sorting, and faceted navigation.

I've been aware of the Whoosh library for Python for a year or so but I've never had a chance to use it much.  I know it provides much of the same functionality as Lucene with similar internals but I had not built any projects using it.

Due to some annoyances with version control, I've decided to jump in and give Whoosh a trial run.  I've often wanted to search through commit messages at work to find commits relating to specific defects or commits made by specific developers.

I decided to start simply and build a library using the following requirements:

  • Each project (tag or branch) in subversion would be a separate index
  • Each index would use a basic schema to encapsulate svn commit messages
  • Both full indexing and incremental indexing (via post-commit hooks) should be supported
  • Searching should be available on commit message, author, revision, and date.
With my first attempt, I've implemented searching and full indexing.  I haven't bothered with the incremental indexing yet but expect to build that in the future.

The similarities between Whoosh and Lucene are obvious and not unexpected - Whoosh claims Lucene as one of its "ancestors" in the Introduction page.  Because of my previous experience with Lucene, learning Whoosh was mostly a matter of making connections between Java concepts in Lucene and the analogous modules in Whoosh.

Overall, I'm pretty pleased with the simplicity of the Whoosh library - it provides the same power as Lucene but in a much friendlier (to me) pythonic model.  In future, I'll be checking out more advanced concepts such as localization, faceted search, and incremental indexing to see how Whoosh compares to Lucene in those regards.  I expect the major difference between Whoosh and Lucene will be in search performance.  For this project, though, that should not be a concern.

The source for the main library is shown below or you can follow the project at https://github.com/khill/svnsearch.

0 comments: