8.01.2009

Lucene - Lessons Learned

Over the last 3 years I have kind of taken the role at work as the Lucene expert. Any enhancements that require search components either come to me directly, or the developer on the project is told to run their ideas by me or chat with me about the project as a bare minimum. This has proven to be a very valuable role in my career and as such has given me the opportunity to lead my team on other experimental projects and concepts with things like JMX, JMS, Security, etc.

As great of a product as Apache Lucene is, it simply amazes me how a product that has been around for so long, and that is used by so many people around the world, has so little documentation. Googling lucene issues will often answer your questions, but rarely do I find the answer to a question I have on any sites that are directly associated with Lucene.

That being said, most of what I know about Lucene has been learned by trial and error, and looking at the source. Last week I was tasked with increasing the relevancy of our search results on some of the search components that I had developed. I was going to be experimenting with boosting score for matches in particular fields and also tuning the fuzzy search to provide as accurate results as could be obtained without completely rewriting the data that backed the search.

Enter the Lucene QueryParser - a mysterious and from what I can gather, not very well understood but extremely powerful tool in the Lucene framework. The QueryParser takes a 'google-style' text query and turns it into a Query object that can be used to search your index. For example, the following string:

name:Chris and name:Schmidt

Would be turned into a Query containing 2 BooleanQuery objects. There are some modifiers that can be added to queries to alter the way that Lucene builds the query objects and this adds a great amout of flexibility to simple searches.

The first one that I will be talking about is the Boost modifier (^). This allows you to specify that for matches found in a particular field, the relevancy should be boosted by a factor of X (or some derivitave therein, since rarely have I seen the boost that I specify as the actual boost applied to the score). To expand on the above example:

name:Chris^1.5 name:Schmidt

This interprets to the same query as above, with the exception that if 10 Schmidts are found, Chris would be the most relevant Schmidt that I am searching for, so if there is a Chris Schmidy result, he should be moved up in relevancy by a factor of 1.5. This can be a pretty handy tool, but it is extremely easy to overdo it on the boosting which can completely destroy the relevancy of your results. A good rule of thumb is to start by boosting the field that you think will apply as the most relevant for the context of the search being performed and boost it in small increments only. a boost of 1.5 may not seem like much until you see how it actually affects your results.

Another good rule of thumb with boosts, is to apply them to things that will be exact keyword matches, applying boost to a fuzzy search will greatly reduce the relevancy of the results you are returning.

Now let's move on to the next modifier, the fuzzy search (~) modifier. This is another one that if used incorrectly can greatly reduce the relevancy of the results that a query returns, and a side effect of using fuzzy searches is that it will return exponentially more results that a standard keyword search will. The fuzzy search uses the Levenshtein Edit Distance Algorythm to calculate what a user actually meant to search for when they fat=finger or mispell a search term.

If you are unfamiliar with the Levenshtein Edit Distance concept, it is a basically a mathematical formula to calculate the # of edits that would need to be applied to a word to transform it into another word. This is a very popular algorythm used by spell checkers and similar applications. An example would be:

C H R I S
C O O L

The edit distance between the 2 words presented above is 4.

To transform Chris into Cool the following edits would have to be made:
1. Change H -> O
2. Change R -> O
3. Change I -> L
4. Drop the S

Lucene usees this Algorythm to calculate word similarity. Although the implementation of this algorythm in Lucene is far more complex ( FuzzyTermEnum - Line 168 ) the basic's are that Lucene calculates the Edit distance between the two words, and divides the edit distance by the length of the shorter term which provides the similarity between the two words.

By default, the Fuzzy search will default to a value of 0.5 similarity as it's threshold but this has always seemed pretty aggressive for most fuzzy searches to me as it basically insinuates that half of the letters in the term can be different and it will consider this a match.

I generally have gone to starting with a baseline of 0.85 and incrementing or decrementing by 0.05 until I reach the sweet-spot where I am finding common mispellings of the terms that I am tuning for but not overdoing it. A good example of where overdoing a fuzzy search can be detrimental is at ServiceMagic where I work we index the names of Home Improvement tasks. There are 2 examples I can think of off the top of my head that have bit us with fuzzy searching.

SIDING
SHINGLE
SLIDE

PLUMBER
PLAYER (DVD)

As you can tell, the tasks that were matched with fuzzy searches have no contextual attachment to eachother. Someone who is looking to get new Siding on there house is probably not looking for someone to repair the roof on their house, or build a playground in the backyard. Along the same lines, someone who has a clogged drain is more than likely not looking for someone to help them install and configure their Blu-Ray DVD Player.

Both of these Modifiers are extremely powerful ways to make your search results great, but they both have drawbacks and when used incorrectly can ruin your results. There is another gotcha with Fuzzy searches that I want to cover quickly. I will probably go into more depth on this subject in a subsequent blog post, however, this bit me hard last week and I think it is worthwhile to share.

There are a great deal of other things that can be done both during indexing and searching to make your results better and one of those is using the PorterStemFilter which takes words that are passed into it and transforms them into what it believes to be the 'root' word of that word. An example of this would be if you had the term Writing or Writes the root word that would be returned after filtering would be Write. This is something that happens in the Analyzer stage of both indexing and queryparsing and as such the following is important to remember when using both fuzzy searches and stemming. If you pass in a query like writing~0.085 to the QueryParser, you would probably assume that the parsed query might look like write~0.085 however, the PorterStemFilter will not stem words that are modified with the Fuzzy search modifier. Where this becomes important is where you are stemming during the indexing phase and doing fuzzy searches in the searching phase. The keyword writing will not actually be indexed anywhere and the by-product of this is that you may not match any documents that you would expect to be matching with this query.

If you are using Stemming and Fuzzy queries, the answer I have found is to generate queries that look similar to this:

keywords:writing~0.085 keywords:writing^1.10 (Match to terms in the keyword field with .085 similarity or better without stemming but apply a 1.10 boost to matches of the stemmed keyword)

It may seem redundant, but when using Stemming the final parsed query will actually be:

keywords:writing~0.085 keywords:write^1.10

This will drastically improve the quality of results when using both of these things to tune your results and indexes.

I would be happy to answer any questions that anyone has about this so feel free to comment or ask away.

3 comments:

  1. Hello!I am working on LUcene 3.6. I wanted to know how I could query a field A and get the results for corresponding fields B and C. So far, I have made functions for getting corresponding B for A, but what should I use for getting multiple fields?
    THankyou!

    ReplyDelete
  2. Your coach factory outlet own areas including South America Brazil sexy bikinis are usually just as arriving revealing simply because levels. Although coach outlet store online Brazil sexy bikinis are usually regarded as put on regarding sailing, your own South America Brazil sexy bikinis are coach outlet usually almost all superbly recognized regarding sunbathing utilize.

    Ladies might coach outlet store online regarded as among around three variants regarding many of these swimwear tops-triangle, halter, together with bandeau. Your coach purses outlet own triangle protects your own breasts location utilizing a set of teeny-tiny portions which coach outlet will claim that a few might cope with an inferior section of your own boxes.

    A coach online outlet brand new halter perfect is performed utilizing to some extent extra sheet and it is cinched through the acoustic guitar neck of the guitar to keep your own bosom by coach outlet way of rapidly declining from your match. Your own bandeau is like the actual flexible television which will techniques within the coach outlet online accompany nestles through the upper body muscle tissue location, without neckties using the acoustic guitar neck of the guitar in addition to back.

    ReplyDelete
  3. Nice post, great insights into Lucene. We're working on fuzziness and I liked your tips on the fuzz factor.

    ReplyDelete