In a humble moment of geekiness, I must admit I love tinkering with algorithms and data structures. I'd also like to point out that this article is not some attempt to educate you as to the best algorithm for any purpose, it is simply a documentation of an engineering thought process (through all its wrongs and rights) in coding up one, tailored algorithm. If you'd like a far more comprehensive treatise on string searching, this book is quite good as is this paper and this one.
A simple pen1s problem
Mailinator gets a fair bit of email - it varies wildly but 185 per second is a pretty common rate. It needs to filter some of it, flag some of it as evil, or simply perform some categorization on it. A lot of it is junk and it clogs the system. Like I've said over and over - Mailinator is not against receiving spam. If you want to get your email about porn or viagra or latest stock tips at Mailinator please feel free. But if anyone sends a billion emails about mortgage refinancing or weener enlargement to my server, it might get smushed and it is probably going to try to thwart them. It's not about the content (although unlawful or hate stuff can stay away thanks) - it's about handling the load.
So one way to do this is to scan every email that comes in for some keywords to determine it's an email we want to flag (i.e. maybe reject, maybe quarantine, whatever) in some way. This simple method can be surprisingly fruitful and it's the idea we'll consider here.
As a real example, let's say we wanted to watch for any emails that come in containing the word "pen1s". Pardon the word choice, but it is seriously a glaring use-case of mine - in fact, that's well into the tame-side of what I see. Its also a pretty good word because spammers often use it to avoid word filters that try to filter out the real word "penis" but it backfires in a sense - it identifies their email as spam. And they use it a LOT.
Searching for a given word in a constant stream of incoming data is sort of the opposite problem that search engines have. They know the search data at all times (i.e. all web pages) - but they don't know what search term you're going to walk up and ask for. Mailinator in contrast knows the search terms at all times (i.e. "pen1s", "v14gr4", "d0nkeyL0ve" (no, I'm not sure what that means either)) but gets new search data continuously.
So if you're a Java programmer, you can simply think that for any incoming email you do:
boolean containsPen1s = (emailBodyText.indexOf("pen1s") > -1);
Right on. This should work. And if you try to guess the obvious way the indexOf method works, you're probably right. It has a for loop that scans through emailBodyText looking for a "p". If it finds a "p", it then checks the next letter to be an "e", and so on. (If you're an algorithmy type person and wondering if I mis-spoke and if indexOf actually uses some smarter algorithm like Boyer-Moore or Knuth-Morris-Pratt, I didn't and it doesn't).
So, lets step back and do some quick math on how much work this is. As we said, we have 185 emails per second. From very strong empirical evidence I can tell you that emails in the system average about 2k each (some much bigger, some smaller, but that's an average).
So.. if we have 2048 bytes per email and 185 emails/sec .. that means we have 378,880 bytes per second.
Now the normal case, we only find the word "pen1s" rarely. By far most emails don't contain it. A near-worst case might be if all our emails contained "pen1pen1pen1pen1.." but that would probably be even more rare.
If we had no false starts at all (i.e. no letter 'p' anywhere) we would have to do 378,880 comparisons per second. That is, we must compare every byte in every incoming email we get to the letter 'p'. In truth, we'll do more (false 'p' matches). In fact the worst case is (the number of bytes of email) * (the number of bytes of the search term). Concretely in our case thats 378,880 * 5. In my use case however, that multiplier is quite unlikely, although a factor 1.5 or so might not be. Regardless, for simplicity lets use 378,880 with the understanding that in practice, it will be some small constant factor higher than that.
Note that you might be well asking right now why I'm not using regular expressions. Good question! I tried that. A state machine should do this comparison in O(n) every time - false starts be damned. But Java regular expressions sure do not. I found them very much slower in every case to simple indexOf. This recent article agrees. Regular expression matching in Java might be convenient, but it sure isn't fast. A nice discussion of state machines, string matching, and a bit of talk about why Java's implementation is slow can be found here (very worth a read).
Great.. So 378,880 comparisons per second is annoying but not too bad. As best as I can measure it, this takes my Athlon 4200 nicely under 1 millisecond (leaving us at least 999 other milliseconds to other server work!).
Enlarging your word list
Unfortunately, that doesn't solve my whole problem - my actual needs grow everyday. And really, I need to search for a lot more words than just "pen1s". In fact my current list is around 100 different words (and growing). So to use indexOf, I'd need to change my search statement to:
boolean containsBadWord = (emailBodyText.indexOf("pen1s") > -1) ||
(emailBodyText.indexOf("weenypie") > -1) ||
(emailBodyText.indexOf("captaincheeseburger") > -1) ||
....and so on
Things are starting to get hairy. On a normal case where we find no matches, we do 378,880 comparisons per search term. If I have a 100 search terms that's 37,888,000 total comparisons. My benchmark now takes about 60milliseconds. This simple task is now taking 6% of every second. And as we add search terms, it keeps growing. What if I get to 500 search terms? or 1000?
One improvement we can do right away is to use an algorithm called Boyer-Moore (which has many flavors now - see the papers I referenced above). These guys had a simple but very important idea.
Why, in fact, do we search our string from the front? Why not instead, start searching from the back?
So let's say someone sends us an email with 1000 X's. No other letters. And we dutifully go ahead and search for "pen1s" (which we won't find of course).
The naive way (like indexOf), we do 1000 comparisons checking "p" against each position (which are all X's). But instead, let's say we start off by hopping ahead 5 positions and instead look for the "s" at the end of "pen1s". Of course we don't find it, but then instead of checking the next spot, we again skip ahead 5 and look for "s". In fact, because we'll get zero matches, we can skip ahead 5 each time.
^ ^ ^ ....
s? s? s? ...
if we don't find the 's' in any of those positions, that immediately rules out the possibility that "pen1s" precedes it.
So instead of 1000 comparisons - we only do 200 !
This is a constant speedup, but it can be quite dramatic. The algorithm has more complexity to it than I've illustrated (like what happens if you encounter an 'e' in one of the hops?) and it won't often run in its "best case" as I discussed above, but thats the gist. The worst case is actually still just as bad as our original search, but again, thats not the likely either.
So in the supremely wonderful case for this example, Boyer-Moore-esque searches cuts our work down by a factor of 5, making it 75,776 comparison for each term or 7,577,600 for the whole 100 (i.e. 12 milliseconds in my benchmark). The scaling problem is just as bad however - every time we add a word, we still add a 75,776 more comparisons.
Most people simply assume that an O(n^2) algorithm is "slower" than an O(n) algorithm. Thats usually true in practice but it doesn't have to be. Big-Oh notation is really an illustration of growth, not speed. If you increment n by one, the O(n^2) will have to do n more operations to complete, the O(n) algorithm will only have to do 1 more.
Boyer-Moore in the best case is O(n/m) where n is the number of bytes being searched and m is the length of the search term (which in our example was 5). We know from algorithms class that O(n/5) is effectively the same as O(n) - because they both grow linearly. Yes, the O(n/5) will run five times faster than the O(n) algorithm - but as n grows, they both grow relatively the same. So if we can handle 100 search terms with indexOf at O(n) then we can probably handle 500 using Boyer-Moore. But we've really just delayed the problem. Further increases will still grow our algorithm out of reach.
Improvement here is based on 2 things:
1) How important the speed here is to your application and/or how big your library of search terms become.
2) How much of a kick you get out of improving, coding, and tinkering with algorithms just for fun even though it really might not be worth it when you probably should be out picking up girls (or guys), skiing, playing Battlefield2 or something.
In truth, our original indexOf is good enough for many applications. And Boyer-moore handles even more. But I, unfortunately often subscribe to rule #2.. so on we trudge...
The lack of scalability starts to bite us. The number of comparisons grows by 378,880 each time we add a new search term for indexOf and by 75,776 for Boyer-Moore.
Ok.. how about another idea. How about if we organize our data better. We know all our search terms up front, what if we could build a structure that was much easier to search.
One such structure is called a Trie (pronounced like "try" even though it supposedly comes from the center of the word "reTRIEval"). So again, lets say our search terms are:
"pen1s", "darn", "golly", and "dart"
We construct a trie as a graph of connected nodes where each node (or edge really) represents each letter as so:
/ | \
d g p
/ | \
a o e
/ | \
r l n
/ \ | \
n t l 1
So each time we need to search we start at the root node and see if we can head down. If we encounter "darp", we'll track down 'd', then 'a', then 'r', but find no 'p' node. Hence we know we do not have a match.
How many comparisons must we do? 378,880. The running time now is back to our single-term indexOf method and is at O(n*m). But notice, it isn't dependent upon how many search terms we have.
Think about that - that's crazy - in effect, you can search for a (theoretical) infinite number of search terms simultaneously all for the cost of ONE indexOf ! (This is of course, the best case scenario).
In other words, we're back to doing 378,880 comparisons but now we can search all 100 search terms with just those 378,880 comparisons. And you want to add 100 more? Sure, still 378,880. 1000 more? no problem. Still 378,880.
The only thing that gets hairy with tries is storage as those nodes can be rather bulky (as you'll see if you try to implement one).
One thing you might notice is that although we gained the ability to search for any number of terms at once, we lost the Boyer-Moored-ness. We must search every byte. That was only a constant speedup, but still it was rather nice.
The algorithm used in Mailinator actually tries to put the Boyer-Moored-ness back (with interesting results). In other words, it tries to search any number of terms while only doing O(n/m) comparisons in the best case (where n is the number of bytes to be searched and m is the length of the shortest search term).
This post is plenty long enough. To be continued....
(Thanks to Frank Yellin for reviewing this post.)