Let me start by saying, "Yeah Right, Twitter!"
Here's the problem; nobody, least of all, Twitter, really knows the extent of the information that was acquired by Hacker Croll a few weeks ago. There is only speculation as to how deep into Twitters infrastructure he got, and only he knows.
Now, just a couple of weeks after the Hacker Croll incident, Twitter suffers from a massive DDoS attack. There are 2 types of DDoS attacks, those that are meant to bring a network down completely, or those that are meant to divert the corporate I.T. guys attention for a period of time while the real work is done on the target service that isn't getting attacked.
If I were Twitter, that's where I would be focusing my attention at this very minute. What services didn't suffer from the DDoS - who accessed those services while the DDoS was happening. Any defiant who has any experience in the field at all will have erased their tracks long before anyone thought to focus on the stuff that didn't go down, and so it is likely, whatever the real purpose of the DDoS was, Twitter will have to sit on their hands until it is revealed or the person behind it slips up.
So, you might find yourself asking, "Well what should you do in a DDoS situation?"
Other's will have different opinions I am sure, but my answer is simple. Focus 50% of your resources on the services that are down and the rest on the services that aren't seemingly affected.
It is always possible this was just some group of $kiddies with a network of zombies just pulling a prank, but given the amount of news around Twitter lately, and the high-profile hacks that have infected their media coverage - I find that highly unlikely.
We will see if I am right soon enough I suppose, but at bare minimum, if I were at Twitter, I would be focusing a lot of attention around performing a full site audit right now and taking inventory of every machine that has access to the internal network, as well as auditing every employee in the organization who was involved directly or indirectly with the fiasco a couple of weeks ago.
What are your thoughts?
8.06.2009
Twitter DDoS'd - Not related to recent activities?
Labels:
ddos,
hack,
internet security,
twitter
8.05.2009
The State of Internet Security - Revisited
In June I wrote a blog on the state of security on the net and I keep hearing the experts saying the same thing that I have said. In an interview with Dan Kaminsky about a recent SSL and DNS Vulnerability, Kaminsky put it out there the same way I did.
"This is our best technology for doing authentication and it failed," he said. "We'll fix it, but it's another sign that we need to revisit how we do the basics; how we do authentication on the internet."
That's exactly it, we need to go back to the drawing board. Why don't we spend some time and money and put all of these experts, and I mean the real experts, the ones who are breaking protocols and smashing the stack every day because they enjoy it, get them all in one place. Why don't we give them a digital whiteboard, all the food they can handle, and let them design a system that works!
Granted, there is no such thing as a completely secure system, but I'll bet that armed with the knowledge that we have today, the tools and a budget, we could come up with something that is a lot closer than a system that was designed before XSS and SQL Injection on the internet were even a twinkle in some $kiddie's parent's eye.
I feel a little bit better now after that rant. What really irks me is that everyone has thought it, most of us have even said it aloud! The system doesn't work. We keep trying to hack fixes into decades old code to account for these new bugs, but it's like putting a brand new Hemi into a 1982 Toyota Corolla - it just doesn't work.
"This is our best technology for doing authentication and it failed," he said. "We'll fix it, but it's another sign that we need to revisit how we do the basics; how we do authentication on the internet."
That's exactly it, we need to go back to the drawing board. Why don't we spend some time and money and put all of these experts, and I mean the real experts, the ones who are breaking protocols and smashing the stack every day because they enjoy it, get them all in one place. Why don't we give them a digital whiteboard, all the food they can handle, and let them design a system that works!
Granted, there is no such thing as a completely secure system, but I'll bet that armed with the knowledge that we have today, the tools and a budget, we could come up with something that is a lot closer than a system that was designed before XSS and SQL Injection on the internet were even a twinkle in some $kiddie's parent's eye.
I feel a little bit better now after that rant. What really irks me is that everyone has thought it, most of us have even said it aloud! The system doesn't work. We keep trying to hack fixes into decades old code to account for these new bugs, but it's like putting a brand new Hemi into a 1982 Toyota Corolla - it just doesn't work.
Labels:
dns,
internet security,
Kaminsky,
ssl
8.04.2009
Synchronizing the HttpSession
This is something that I have heard a great deal of debate over the last 2 years about. The servlet spec was somewhat recently amended to clarify that there is no guarantee that multiple calls to HttpServletRequest.getSession() or HttpServletRequest.getSession(boolean) will return the same object. This holds especially true in containers that return a facade object that wraps around the actual HttpSession object that you are working with, like Tomcat.
Why would you want to synchronize a session anyway?
The answer is pretty simple actually. Consider the following theoretical block of code:
Now there are a couple things that I will point out that I am sure you will notice if you are paying attention. The first is that yes, this example is using the ESAPI. Call it a shameless plug :). The second is that I am ignoring AccessControlExceptions. This is purely to keep this example scenario short and to the point, and in any production code, you would never want to do this. There would also be some validation code in there as well.
Aside from those things, it looks innocent enough right? Well let's consider this for a second with a scenario.
Joe needs to have a check cut to him from his account at SomeBodiesBank. So he gets online and hits the form for the above servlet. Joe is not that savvy of a computer user, and like most novice internet users will do, he has the tendency to double-click on everything. He fills out the form to withdraw $500 from his account and double-clicks the submit button. So somewhere on the backend, we'll say in the AccountFacade.withdraw method, the software validates that Joe has enough money to cover the check, it discovers he has $750 in his Checking account so everything looks good. But wait a minute, Joe double-clicked remember?
Do you know what happens when you double click the submit button on a form? Well, 2 requests get submitted one after the other. Hmmmmmm.. So now I have 2 requests entering this method at the exact same time, both requests check Joe's balance and discover that he has $750 in his account, so they both queue up a request to print a check for the requested amount. There's only one problem, these are cashiers checks, the bank has withdrawn $1000 dollars (or in some circumstances, maybe only withdrew the original $500 from his account) but Joe ended up with $1000 in cashiers checks!
The checks show up in the mail, and Joe being the responsible individual he is, reports this to the bank. The bank will likely write this off as an anomoly and the bug will remain until one day when Joe is down on his luck and remembers the bug. He finds a program called JMeter and submits 1000 requests to the servlet as fast as he can for $1000 withdrawals. When his 1,000,000 in cashiers checks arrive, he promptly leaves the country and disappears in the backwoods of New Zealand never to be heard from again.
So the moral of the story is that this problem could have been easily avoided simply by adding thread-safety measures to the code. Granted the example cited is extreme and the consequence of the example even more extreme, but I can promise you that something similar to the situation has already happened and even moreso I can guarantee that something similar will happen again.
So, with this knowledge, what is the correct means to add thread safety around manipulating the session. It's quite simple even.
Would do the trick in this simple example.
It's important when using synchronization to always lock on immutable objects. It is also important to use the same lock when locking in multiple places where you are working with the same data. Thread-safety is an entire subject on it's own that is will beyond the scope of this blog posting, so I will cut to the chase here.
This is incorrect, and not guaranteed:
While this method is proven and works:
Some interesting stats to close out with:
Google Code Search Results found approximately 4000 uses of different ways to say 'synchronized(session)'
The scary part is this was only the first 5 ways I came up with to search for it.
Why would you want to synchronize a session anyway?
The answer is pretty simple actually. Consider the following theoretical block of code:
public class WithdrawFundsServlet extends HttpServlet {
@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
User u = ESAPI.authenticator().getCurrentUser();
String withdrawAmt = request.getParameter("withdrawAmt");
float amt;
Account acct = session.getAttribute("acct_"+u.getAccount());
try
{
amt = Float.parseFloat(withdrawAmt);
}
catch ( Throwable t )
{
ESAPI.log().info( Logger.SECURITY_FAILURE, "Non-Numeric value passed as Withdraw Amount");
try {
ESAPI.httpUtilities().sendForward(request, response, "/error" );
} catch (AccessControlException ignored) { }
}
// Calling Withdraw will queue a check to be printed and mailed to the customer.
AccountFacade.withdraw( acct, amt );
try
{
ESAPI.httpUtilities().sendForward(request, response, "/success" );
}
catch (AccessControlException ignored) { }
return;
}
}
Now there are a couple things that I will point out that I am sure you will notice if you are paying attention. The first is that yes, this example is using the ESAPI. Call it a shameless plug :). The second is that I am ignoring AccessControlExceptions. This is purely to keep this example scenario short and to the point, and in any production code, you would never want to do this. There would also be some validation code in there as well.
Aside from those things, it looks innocent enough right? Well let's consider this for a second with a scenario.
Joe needs to have a check cut to him from his account at SomeBodiesBank. So he gets online and hits the form for the above servlet. Joe is not that savvy of a computer user, and like most novice internet users will do, he has the tendency to double-click on everything. He fills out the form to withdraw $500 from his account and double-clicks the submit button. So somewhere on the backend, we'll say in the AccountFacade.withdraw method, the software validates that Joe has enough money to cover the check, it discovers he has $750 in his Checking account so everything looks good. But wait a minute, Joe double-clicked remember?
Do you know what happens when you double click the submit button on a form? Well, 2 requests get submitted one after the other. Hmmmmmm.. So now I have 2 requests entering this method at the exact same time, both requests check Joe's balance and discover that he has $750 in his account, so they both queue up a request to print a check for the requested amount. There's only one problem, these are cashiers checks, the bank has withdrawn $1000 dollars (or in some circumstances, maybe only withdrew the original $500 from his account) but Joe ended up with $1000 in cashiers checks!
The checks show up in the mail, and Joe being the responsible individual he is, reports this to the bank. The bank will likely write this off as an anomoly and the bug will remain until one day when Joe is down on his luck and remembers the bug. He finds a program called JMeter and submits 1000 requests to the servlet as fast as he can for $1000 withdrawals. When his 1,000,000 in cashiers checks arrive, he promptly leaves the country and disappears in the backwoods of New Zealand never to be heard from again.
So the moral of the story is that this problem could have been easily avoided simply by adding thread-safety measures to the code. Granted the example cited is extreme and the consequence of the example even more extreme, but I can promise you that something similar to the situation has already happened and even moreso I can guarantee that something similar will happen again.
So, with this knowledge, what is the correct means to add thread safety around manipulating the session. It's quite simple even.
final Object lock = request.getSession().getId().intern();
synchronized(lock) {
AccountFacade.withdraw( acct, amt );
}
Would do the trick in this simple example.
It's important when using synchronization to always lock on immutable objects. It is also important to use the same lock when locking in multiple places where you are working with the same data. Thread-safety is an entire subject on it's own that is will beyond the scope of this blog posting, so I will cut to the chase here.
This is incorrect, and not guaranteed:
synchronized (request.getSession()) {
// do stuff
}
While this method is proven and works:
synchronized (request.getSession().getId().intern()) {
// do stuff
}
Some interesting stats to close out with:
Google Code Search Results found approximately 4000 uses of different ways to say 'synchronized(session)'
The scary part is this was only the first 5 ways I came up with to search for it.
Labels:
application security,
ESAPI,
J2EE,
java,
Servlets,
thread safety
8.03.2009
Eric Schmidt, Google, and Apple
It appears that my long lost relative, Eric Schmidt, has left the Apple BoD. What do I think about that, simply that I really need to get in touch with Eric and see if he wants to loan his long lost relative, me, a couple million. Other than that, I think this will probably resolve the whole Apple/Google thing for the most part, but I really don't think that Apple wants to get themselves into an Apple vs. Google scenario. We will see what happens, but I imagine this all going away and news shifting back into the Microsoft vs. the world scene shortly.
Labels:
Apple,
Eric Schmidt,
Google,
random thoughts
8.02.2009
What is ESAPI?
I have recently gotten involved in the OWASP ESAPI Project. I am on the team of developers working on v2.0 of the API which will include updating the API to take advantage of all the features that Java5 brought to the table, increasing performance of the reference implementation and improving thread-safety throughout the entire codebase. It has thus far been a great experience and there are some very smart people behind the entire project.
So what exactly is the OWASP ESAPI?
Well, let's start with, what exactly is OWASP?
OWASP is the Open Web Application Security Project. It is a NPO made up of people from all over the world with the single goal to provide a single repository of information and tools for writing secure and reliable web applications.
The ESAPI is a small part of the overall goal of OWASP, but is a great example of what OWASP stands for and has set out to do.
ESAPI stands for Enterprise Security API - and it is just that, an API. There is a reference implementation included in the distribution that can be dropped into an startup or existing application and configured to use, but the real power of the ESAPI is that it defines a standard interface for providing secure implementations of standard API methods that are not secure.
That is a pretty broad statement but it is probably the best way to explain it. See, the ESAPI is not an application by itself, it is not even really a framework - it is a toolkit. It provides you with an API that is self documenting and provides a central set of methods for developers to access information, log data, authenticate users, and much more.
The ESAPI is distributed for Java, .Net, and there are more implementations in the works for PHP, Python, and others I am sure.
So let's have a quick overview of what the ESAPI provides to developers:
1. Authentication - Provides a good reference implementation and a well documented authentication mechanism that can be used on top of the standard J2EE Security Model (Standard User/Role mechanism)
2. Logging - Provides a central repository for logging in your application. The Java API uses either the standard Java Logging or Log4J by default, but you could implement your own logging by implementing the Logger interface.
3. Validation - Provides a powerful set of input validation classes that not only validate but also filter user input to remove the responsibility of input filtering from the hands of your application developers.
4. Encoding/Decoding - A full toolset of Encoders and Decoders including UTF-8, Base64, and much more.
5. Web Application Fireall - WAF's are easily one of the most argued about issues in the Realm of AppSec, but there are several of them out there and the ESAPI makes it easy to implement your own WAF where it makes the most sense to me, at the Application Layer. The WAF works off the same principles of most where a set of rules and reactions are defined but by keeping it in the Application Layer, this will allow your Enterprise Security Architects, or even your regular old Developers to create complex WAF rules based on logic that can be determined by the state of your application itself. This is a very powerful tool for large web applications.
These are the 5 "main" parts of the ESAPI. Now let's get to the REAL power of the ESAPI.
In a normal web application, your security constraints and controls are defined across your entire codebase, where they are used. This creates a couple of problems. The larger your application becomes, the more difficult this becomes to maintain. Developers will start coding their own solutions to security concerns as opposed to using the one that is used everyplace else simply because they may not know that the problem they are trying to solve has already been solved. So now you have 2 different ways to solve the same problem. Sound like a maintenance nightmare waiting to happen?
The biggest feature in my mind of the ESAPI is that it allows your developers to focus on writing the code that they are good at. Not everyone is a security expert, and even if they aren't they are probably really good at their job, that is why you hired them. Your security (whether it be the guy that used to hack websites for fun, or a genuine Enterprise Security Architect) can define the rules and requirements of your applications security, implement it once and your developers will know that if I want to authenticate a user I just use:
Sounds pretty easy right? It is!
I strongly recommend that anyone starting a new application look into the ESAPI for their application. There is a ton of information in general on Web Application Security on the OWASP site.
ESAPI Links:
ESAPI Homepage
ESAPI on Google Code
ESAPI .Net
So what exactly is the OWASP ESAPI?
Well, let's start with, what exactly is OWASP?
OWASP is the Open Web Application Security Project. It is a NPO made up of people from all over the world with the single goal to provide a single repository of information and tools for writing secure and reliable web applications.
The ESAPI is a small part of the overall goal of OWASP, but is a great example of what OWASP stands for and has set out to do.
ESAPI stands for Enterprise Security API - and it is just that, an API. There is a reference implementation included in the distribution that can be dropped into an startup or existing application and configured to use, but the real power of the ESAPI is that it defines a standard interface for providing secure implementations of standard API methods that are not secure.
That is a pretty broad statement but it is probably the best way to explain it. See, the ESAPI is not an application by itself, it is not even really a framework - it is a toolkit. It provides you with an API that is self documenting and provides a central set of methods for developers to access information, log data, authenticate users, and much more.
The ESAPI is distributed for Java, .Net, and there are more implementations in the works for PHP, Python, and others I am sure.
So let's have a quick overview of what the ESAPI provides to developers:
1. Authentication - Provides a good reference implementation and a well documented authentication mechanism that can be used on top of the standard J2EE Security Model (Standard User/Role mechanism)
2. Logging - Provides a central repository for logging in your application. The Java API uses either the standard Java Logging or Log4J by default, but you could implement your own logging by implementing the Logger interface.
3. Validation - Provides a powerful set of input validation classes that not only validate but also filter user input to remove the responsibility of input filtering from the hands of your application developers.
4. Encoding/Decoding - A full toolset of Encoders and Decoders including UTF-8, Base64, and much more.
5. Web Application Fireall - WAF's are easily one of the most argued about issues in the Realm of AppSec, but there are several of them out there and the ESAPI makes it easy to implement your own WAF where it makes the most sense to me, at the Application Layer. The WAF works off the same principles of most where a set of rules and reactions are defined but by keeping it in the Application Layer, this will allow your Enterprise Security Architects, or even your regular old Developers to create complex WAF rules based on logic that can be determined by the state of your application itself. This is a very powerful tool for large web applications.
These are the 5 "main" parts of the ESAPI. Now let's get to the REAL power of the ESAPI.
In a normal web application, your security constraints and controls are defined across your entire codebase, where they are used. This creates a couple of problems. The larger your application becomes, the more difficult this becomes to maintain. Developers will start coding their own solutions to security concerns as opposed to using the one that is used everyplace else simply because they may not know that the problem they are trying to solve has already been solved. So now you have 2 different ways to solve the same problem. Sound like a maintenance nightmare waiting to happen?
The biggest feature in my mind of the ESAPI is that it allows your developers to focus on writing the code that they are good at. Not everyone is a security expert, and even if they aren't they are probably really good at their job, that is why you hired them. Your security (whether it be the guy that used to hack websites for fun, or a genuine Enterprise Security Architect) can define the rules and requirements of your applications security, implement it once and your developers will know that if I want to authenticate a user I just use:
ESAPI.authenticator().login(HTTPServletRequest, HTTPServletResponse);
Sounds pretty easy right? It is!
I strongly recommend that anyone starting a new application look into the ESAPI for their application. There is a ton of information in general on Web Application Security on the OWASP site.
ESAPI Links:
ESAPI Homepage
ESAPI on Google Code
ESAPI .Net
8.01.2009
Lucene - Lessons Learned
Over the last 3 years I have kind of taken the role at work as the Lucene expert. Any enhancements that require search components either come to me directly, or the developer on the project is told to run their ideas by me or chat with me about the project as a bare minimum. This has proven to be a very valuable role in my career and as such has given me the opportunity to lead my team on other experimental projects and concepts with things like JMX, JMS, Security, etc.
As great of a product as Apache Lucene is, it simply amazes me how a product that has been around for so long, and that is used by so many people around the world, has so little documentation. Googling lucene issues will often answer your questions, but rarely do I find the answer to a question I have on any sites that are directly associated with Lucene.
That being said, most of what I know about Lucene has been learned by trial and error, and looking at the source. Last week I was tasked with increasing the relevancy of our search results on some of the search components that I had developed. I was going to be experimenting with boosting score for matches in particular fields and also tuning the fuzzy search to provide as accurate results as could be obtained without completely rewriting the data that backed the search.
Enter the Lucene QueryParser - a mysterious and from what I can gather, not very well understood but extremely powerful tool in the Lucene framework. The QueryParser takes a 'google-style' text query and turns it into a Query object that can be used to search your index. For example, the following string:
name:Chris and name:Schmidt
Would be turned into a Query containing 2 BooleanQuery objects. There are some modifiers that can be added to queries to alter the way that Lucene builds the query objects and this adds a great amout of flexibility to simple searches.
The first one that I will be talking about is the Boost modifier (^). This allows you to specify that for matches found in a particular field, the relevancy should be boosted by a factor of X (or some derivitave therein, since rarely have I seen the boost that I specify as the actual boost applied to the score). To expand on the above example:
name:Chris^1.5 name:Schmidt
This interprets to the same query as above, with the exception that if 10 Schmidts are found, Chris would be the most relevant Schmidt that I am searching for, so if there is a Chris Schmidy result, he should be moved up in relevancy by a factor of 1.5. This can be a pretty handy tool, but it is extremely easy to overdo it on the boosting which can completely destroy the relevancy of your results. A good rule of thumb is to start by boosting the field that you think will apply as the most relevant for the context of the search being performed and boost it in small increments only. a boost of 1.5 may not seem like much until you see how it actually affects your results.
Another good rule of thumb with boosts, is to apply them to things that will be exact keyword matches, applying boost to a fuzzy search will greatly reduce the relevancy of the results you are returning.
Now let's move on to the next modifier, the fuzzy search (~) modifier. This is another one that if used incorrectly can greatly reduce the relevancy of the results that a query returns, and a side effect of using fuzzy searches is that it will return exponentially more results that a standard keyword search will. The fuzzy search uses the Levenshtein Edit Distance Algorythm to calculate what a user actually meant to search for when they fat=finger or mispell a search term.
If you are unfamiliar with the Levenshtein Edit Distance concept, it is a basically a mathematical formula to calculate the # of edits that would need to be applied to a word to transform it into another word. This is a very popular algorythm used by spell checkers and similar applications. An example would be:
C H R I S
C O O L
The edit distance between the 2 words presented above is 4.
To transform Chris into Cool the following edits would have to be made:
1. Change H -> O
2. Change R -> O
3. Change I -> L
4. Drop the S
Lucene usees this Algorythm to calculate word similarity. Although the implementation of this algorythm in Lucene is far more complex ( FuzzyTermEnum - Line 168 ) the basic's are that Lucene calculates the Edit distance between the two words, and divides the edit distance by the length of the shorter term which provides the similarity between the two words.
By default, the Fuzzy search will default to a value of 0.5 similarity as it's threshold but this has always seemed pretty aggressive for most fuzzy searches to me as it basically insinuates that half of the letters in the term can be different and it will consider this a match.
I generally have gone to starting with a baseline of 0.85 and incrementing or decrementing by 0.05 until I reach the sweet-spot where I am finding common mispellings of the terms that I am tuning for but not overdoing it. A good example of where overdoing a fuzzy search can be detrimental is at ServiceMagic where I work we index the names of Home Improvement tasks. There are 2 examples I can think of off the top of my head that have bit us with fuzzy searching.
SIDING
SHINGLE
SLIDE
PLUMBER
PLAYER (DVD)
As you can tell, the tasks that were matched with fuzzy searches have no contextual attachment to eachother. Someone who is looking to get new Siding on there house is probably not looking for someone to repair the roof on their house, or build a playground in the backyard. Along the same lines, someone who has a clogged drain is more than likely not looking for someone to help them install and configure their Blu-Ray DVD Player.
Both of these Modifiers are extremely powerful ways to make your search results great, but they both have drawbacks and when used incorrectly can ruin your results. There is another gotcha with Fuzzy searches that I want to cover quickly. I will probably go into more depth on this subject in a subsequent blog post, however, this bit me hard last week and I think it is worthwhile to share.
There are a great deal of other things that can be done both during indexing and searching to make your results better and one of those is using the PorterStemFilter which takes words that are passed into it and transforms them into what it believes to be the 'root' word of that word. An example of this would be if you had the term Writing or Writes the root word that would be returned after filtering would be Write. This is something that happens in the Analyzer stage of both indexing and queryparsing and as such the following is important to remember when using both fuzzy searches and stemming. If you pass in a query like writing~0.085 to the QueryParser, you would probably assume that the parsed query might look like write~0.085 however, the PorterStemFilter will not stem words that are modified with the Fuzzy search modifier. Where this becomes important is where you are stemming during the indexing phase and doing fuzzy searches in the searching phase. The keyword writing will not actually be indexed anywhere and the by-product of this is that you may not match any documents that you would expect to be matching with this query.
If you are using Stemming and Fuzzy queries, the answer I have found is to generate queries that look similar to this:
keywords:writing~0.085 keywords:writing^1.10 (Match to terms in the keyword field with .085 similarity or better without stemming but apply a 1.10 boost to matches of the stemmed keyword)
It may seem redundant, but when using Stemming the final parsed query will actually be:
keywords:writing~0.085 keywords:write^1.10
This will drastically improve the quality of results when using both of these things to tune your results and indexes.
I would be happy to answer any questions that anyone has about this so feel free to comment or ask away.
As great of a product as Apache Lucene is, it simply amazes me how a product that has been around for so long, and that is used by so many people around the world, has so little documentation. Googling lucene issues will often answer your questions, but rarely do I find the answer to a question I have on any sites that are directly associated with Lucene.
That being said, most of what I know about Lucene has been learned by trial and error, and looking at the source. Last week I was tasked with increasing the relevancy of our search results on some of the search components that I had developed. I was going to be experimenting with boosting score for matches in particular fields and also tuning the fuzzy search to provide as accurate results as could be obtained without completely rewriting the data that backed the search.
Enter the Lucene QueryParser - a mysterious and from what I can gather, not very well understood but extremely powerful tool in the Lucene framework. The QueryParser takes a 'google-style' text query and turns it into a Query object that can be used to search your index. For example, the following string:
name:Chris and name:Schmidt
Would be turned into a Query containing 2 BooleanQuery objects. There are some modifiers that can be added to queries to alter the way that Lucene builds the query objects and this adds a great amout of flexibility to simple searches.
The first one that I will be talking about is the Boost modifier (^). This allows you to specify that for matches found in a particular field, the relevancy should be boosted by a factor of X (or some derivitave therein, since rarely have I seen the boost that I specify as the actual boost applied to the score). To expand on the above example:
name:Chris^1.5 name:Schmidt
This interprets to the same query as above, with the exception that if 10 Schmidts are found, Chris would be the most relevant Schmidt that I am searching for, so if there is a Chris Schmidy result, he should be moved up in relevancy by a factor of 1.5. This can be a pretty handy tool, but it is extremely easy to overdo it on the boosting which can completely destroy the relevancy of your results. A good rule of thumb is to start by boosting the field that you think will apply as the most relevant for the context of the search being performed and boost it in small increments only. a boost of 1.5 may not seem like much until you see how it actually affects your results.
Another good rule of thumb with boosts, is to apply them to things that will be exact keyword matches, applying boost to a fuzzy search will greatly reduce the relevancy of the results you are returning.
Now let's move on to the next modifier, the fuzzy search (~) modifier. This is another one that if used incorrectly can greatly reduce the relevancy of the results that a query returns, and a side effect of using fuzzy searches is that it will return exponentially more results that a standard keyword search will. The fuzzy search uses the Levenshtein Edit Distance Algorythm to calculate what a user actually meant to search for when they fat=finger or mispell a search term.
If you are unfamiliar with the Levenshtein Edit Distance concept, it is a basically a mathematical formula to calculate the # of edits that would need to be applied to a word to transform it into another word. This is a very popular algorythm used by spell checkers and similar applications. An example would be:
C H R I S
C O O L
The edit distance between the 2 words presented above is 4.
To transform Chris into Cool the following edits would have to be made:
1. Change H -> O
2. Change R -> O
3. Change I -> L
4. Drop the S
Lucene usees this Algorythm to calculate word similarity. Although the implementation of this algorythm in Lucene is far more complex ( FuzzyTermEnum - Line 168 ) the basic's are that Lucene calculates the Edit distance between the two words, and divides the edit distance by the length of the shorter term which provides the similarity between the two words.
By default, the Fuzzy search will default to a value of 0.5 similarity as it's threshold but this has always seemed pretty aggressive for most fuzzy searches to me as it basically insinuates that half of the letters in the term can be different and it will consider this a match.
I generally have gone to starting with a baseline of 0.85 and incrementing or decrementing by 0.05 until I reach the sweet-spot where I am finding common mispellings of the terms that I am tuning for but not overdoing it. A good example of where overdoing a fuzzy search can be detrimental is at ServiceMagic where I work we index the names of Home Improvement tasks. There are 2 examples I can think of off the top of my head that have bit us with fuzzy searching.
SIDING
SHINGLE
SLIDE
PLUMBER
PLAYER (DVD)
As you can tell, the tasks that were matched with fuzzy searches have no contextual attachment to eachother. Someone who is looking to get new Siding on there house is probably not looking for someone to repair the roof on their house, or build a playground in the backyard. Along the same lines, someone who has a clogged drain is more than likely not looking for someone to help them install and configure their Blu-Ray DVD Player.
Both of these Modifiers are extremely powerful ways to make your search results great, but they both have drawbacks and when used incorrectly can ruin your results. There is another gotcha with Fuzzy searches that I want to cover quickly. I will probably go into more depth on this subject in a subsequent blog post, however, this bit me hard last week and I think it is worthwhile to share.
There are a great deal of other things that can be done both during indexing and searching to make your results better and one of those is using the PorterStemFilter which takes words that are passed into it and transforms them into what it believes to be the 'root' word of that word. An example of this would be if you had the term Writing or Writes the root word that would be returned after filtering would be Write. This is something that happens in the Analyzer stage of both indexing and queryparsing and as such the following is important to remember when using both fuzzy searches and stemming. If you pass in a query like writing~0.085 to the QueryParser, you would probably assume that the parsed query might look like write~0.085 however, the PorterStemFilter will not stem words that are modified with the Fuzzy search modifier. Where this becomes important is where you are stemming during the indexing phase and doing fuzzy searches in the searching phase. The keyword writing will not actually be indexed anywhere and the by-product of this is that you may not match any documents that you would expect to be matching with this query.
If you are using Stemming and Fuzzy queries, the answer I have found is to generate queries that look similar to this:
keywords:writing~0.085 keywords:writing^1.10 (Match to terms in the keyword field with .085 similarity or better without stemming but apply a 1.10 boost to matches of the stemmed keyword)
It may seem redundant, but when using Stemming the final parsed query will actually be:
keywords:writing~0.085 keywords:write^1.10
This will drastically improve the quality of results when using both of these things to tune your results and indexes.
I would be happy to answer any questions that anyone has about this so feel free to comment or ask away.
Labels:
java,
Lucene,
Tips and Tricks
Subscribe to:
Posts (Atom)