Monday, January 21, 2008

How to search for the word "pen1s" in 185 emails every second

SubTitle: How Mailinator searches every incoming email for a few hundred strings efficiently (in Java).

In a humble moment of geekiness, I must admit I love tinkering with algorithms and data structures. I'd also like to point out that this article is not some attempt to educate you as to the best algorithm for any purpose, it is simply a documentation of an engineering thought process (through all its wrongs and rights) in coding up one, tailored algorithm. If you'd like a far more comprehensive treatise on string searching, this book is quite good as is this paper and this one.

A simple pen1s problem


Mailinator gets a fair bit of email - it varies wildly but 185 per second is a pretty common rate. It needs to filter some of it, flag some of it as evil, or simply perform some categorization on it. A lot of it is junk and it clogs the system. Like I've said over and over - Mailinator is not against receiving spam. If you want to get your email about porn or viagra or latest stock tips at Mailinator please feel free. But if anyone sends a billion emails about mortgage refinancing or weener enlargement to my server, it might get smushed and it is probably going to try to thwart them. It's not about the content (although unlawful or hate stuff can stay away thanks) - it's about handling the load.

So one way to do this is to scan every email that comes in for some keywords to determine it's an email we want to flag (i.e. maybe reject, maybe quarantine, whatever) in some way. This simple method can be surprisingly fruitful and it's the idea we'll consider here.

As a real example, let's say we wanted to watch for any emails that come in containing the word "pen1s". Pardon the word choice, but it is seriously a glaring use-case of mine - in fact, that's well into the tame-side of what I see. Its also a pretty good word because spammers often use it to avoid word filters that try to filter out the real word "penis" but it backfires in a sense - it identifies their email as spam. And they use it a LOT.

Searching for a given word in a constant stream of incoming data is sort of the opposite problem that search engines have. They know the search data at all times (i.e. all web pages) - but they don't know what search term you're going to walk up and ask for. Mailinator in contrast knows the search terms at all times (i.e. "pen1s", "v14gr4", "d0nkeyL0ve" (no, I'm not sure what that means either)) but gets new search data continuously.

So if you're a Java programmer, you can simply think that for any incoming email you do:


boolean containsPen1s = (emailBodyText.indexOf("pen1s") > -1);


Right on. This should work. And if you try to guess the obvious way the indexOf method works, you're probably right. It has a for loop that scans through emailBodyText looking for a "p". If it finds a "p", it then checks the next letter to be an "e", and so on. (If you're an algorithmy type person and wondering if I mis-spoke and if indexOf actually uses some smarter algorithm like Boyer-Moore or Knuth-Morris-Pratt, I didn't and it doesn't).

So, lets step back and do some quick math on how much work this is. As we said, we have 185 emails per second. From very strong empirical evidence I can tell you that emails in the system average about 2k each (some much bigger, some smaller, but that's an average).

So.. if we have 2048 bytes per email and 185 emails/sec .. that means we have 378,880 bytes per second.

Now the normal case, we only find the word "pen1s" rarely. By far most emails don't contain it. A near-worst case might be if all our emails contained "pen1pen1pen1pen1.." but that would probably be even more rare.

If we had no false starts at all (i.e. no letter 'p' anywhere) we would have to do 378,880 comparisons per second. That is, we must compare every byte in every incoming email we get to the letter 'p'. In truth, we'll do more (false 'p' matches). In fact the worst case is (the number of bytes of email) * (the number of bytes of the search term). Concretely in our case thats 378,880 * 5. In my use case however, that multiplier is quite unlikely, although a factor 1.5 or so might not be. Regardless, for simplicity lets use 378,880 with the understanding that in practice, it will be some small constant factor higher than that.

Note that you might be well asking right now why I'm not using regular expressions. Good question! I tried that. A state machine should do this comparison in O(n) every time - false starts be damned. But Java regular expressions sure do not. I found them very much slower in every case to simple indexOf. This recent article agrees. Regular expression matching in Java might be convenient, but it sure isn't fast. A nice discussion of state machines, string matching, and a bit of talk about why Java's implementation is slow can be found here (very worth a read).

Great.. So 378,880 comparisons per second is annoying but not too bad. As best as I can measure it, this takes my Athlon 4200 nicely under 1 millisecond (leaving us at least 999 other milliseconds to other server work!).

Enlarging your word list


Unfortunately, that doesn't solve my whole problem - my actual needs grow everyday. And really, I need to search for a lot more words than just "pen1s". In fact my current list is around 100 different words (and growing). So to use indexOf, I'd need to change my search statement to:


boolean containsBadWord = (emailBodyText.indexOf("pen1s") > -1) ||
(emailBodyText.indexOf("weenypie") > -1) ||
(emailBodyText.indexOf("captaincheeseburger") > -1) ||
....and so on

Things are starting to get hairy. On a normal case where we find no matches, we do 378,880 comparisons per search term. If I have a 100 search terms that's 37,888,000 total comparisons. My benchmark now takes about 60milliseconds. This simple task is now taking 6% of every second. And as we add search terms, it keeps growing. What if I get to 500 search terms? or 1000?

One improvement we can do right away is to use an algorithm called Boyer-Moore (which has many flavors now - see the papers I referenced above). These guys had a simple but very important idea.

Why, in fact, do we search our string from the front? Why not instead, start searching from the back?

So let's say someone sends us an email with 1000 X's. No other letters. And we dutifully go ahead and search for "pen1s" (which we won't find of course).

The naive way (like indexOf), we do 1000 comparisons checking "p" against each position (which are all X's). But instead, let's say we start off by hopping ahead 5 positions and instead look for the "s" at the end of "pen1s". Of course we don't find it, but then instead of checking the next spot, we again skip ahead 5 and look for "s". In fact, because we'll get zero matches, we can skip ahead 5 each time.


XXXXXXXXXXXXXXXXXXXXXXX
    ^    ^    ^  ....
    s?   s?   s?  ...

if we don't find the 's' in any of those positions, that immediately rules out the possibility that "pen1s" precedes it.

So instead of 1000 comparisons - we only do 200 !

This is a constant speedup, but it can be quite dramatic. The algorithm has more complexity to it than I've illustrated (like what happens if you encounter an 'e' in one of the hops?) and it won't often run in its "best case" as I discussed above, but thats the gist. The worst case is actually still just as bad as our original search, but again, thats not the likely either.

So in the supremely wonderful case for this example, Boyer-Moore-esque searches cuts our work down by a factor of 5, making it 75,776 comparison for each term or 7,577,600 for the whole 100 (i.e. 12 milliseconds in my benchmark). The scaling problem is just as bad however - every time we add a word, we still add a 75,776 more comparisons.

Most people simply assume that an O(n^2) algorithm is "slower" than an O(n) algorithm. Thats usually true in practice but it doesn't have to be. Big-Oh notation is really an illustration of growth, not speed. If you increment n by one, the O(n^2) will have to do n more operations to complete, the O(n) algorithm will only have to do 1 more.

Boyer-Moore in the best case is O(n/m) where n is the number of bytes being searched and m is the length of the search term (which in our example was 5). We know from algorithms class that O(n/5) is effectively the same as O(n) - because they both grow linearly. Yes, the O(n/5) will run five times faster than the O(n) algorithm - but as n grows, they both grow relatively the same. So if we can handle 100 search terms with indexOf at O(n) then we can probably handle 500 using Boyer-Moore. But we've really just delayed the problem. Further increases will still grow our algorithm out of reach.

Improvement here is based on 2 things:
1) How important the speed here is to your application and/or how big your library of search terms become.
or,
2) How much of a kick you get out of improving, coding, and tinkering with algorithms just for fun even though it really might not be worth it when you probably should be out picking up girls (or guys), skiing, playing Battlefield2 or something.

In truth, our original indexOf is good enough for many applications. And Boyer-moore handles even more. But I, unfortunately often subscribe to rule #2.. so on we trudge...

The lack of scalability starts to bite us. The number of comparisons grows by 378,880 each time we add a new search term for indexOf and by 75,776 for Boyer-Moore.

Trie this


Ok.. how about another idea. How about if we organize our data better. We know all our search terms up front, what if we could build a structure that was much easier to search.

One such structure is called a Trie (pronounced like "try" even though it supposedly comes from the center of the word "reTRIEval"). So again, lets say our search terms are:

"pen1s", "darn", "golly", and "dart"

We construct a trie as a graph of connected nodes where each node (or edge really) represents each letter as so:


      root-node
       / | \
      d  g  p
     /   |   \
    a    o    e
   /     |     \
  r      l      n
 / \     |       \
n   t    l        1
         |         \
         y          s


So each time we need to search we start at the root node and see if we can head down. If we encounter "darp", we'll track down 'd', then 'a', then 'r', but find no 'p' node. Hence we know we do not have a match.

How many comparisons must we do? 378,880. The running time now is back to our single-term indexOf method and is at O(n*m). But notice, it isn't dependent upon how many search terms we have.

Think about that - that's crazy - in effect, you can search for a (theoretical) infinite number of search terms simultaneously all for the cost of ONE indexOf ! (This is of course, the best case scenario).

In other words, we're back to doing 378,880 comparisons but now we can search all 100 search terms with just those 378,880 comparisons. And you want to add 100 more? Sure, still 378,880. 1000 more? no problem. Still 378,880.

The only thing that gets hairy with tries is storage as those nodes can be rather bulky (as you'll see if you try to implement one).

One thing you might notice is that although we gained the ability to search for any number of terms at once, we lost the Boyer-Moored-ness. We must search every byte. That was only a constant speedup, but still it was rather nice.

The algorithm used in Mailinator actually tries to put the Boyer-Moored-ness back (with interesting results). In other words, it tries to search any number of terms while only doing O(n/m) comparisons in the best case (where n is the number of bytes to be searched and m is the length of the shortest search term).

This post is plenty long enough. To be continued....



(Thanks to Frank Yellin for reviewing this post.)

15 comments:

HashiDiKo said...

Great post but it's got me wondering if you've implemented your trie using linked nodes or if you found a more [memory] efficient method.

Anonymous said...

please enable full feeds...

Or I will have to use my pipes power!

Paul Tyma said...

Sorry about that...

enabled.

SeanB said...

I enjoyed that, thanks!

Egwor said...

hashdiko - what do you mean about a more memory efficient method?

heartsjava said...

Very cool. I was actually digging around in the String.java source today looking at indexOf, and wondering to myself if there was a better way of searching for a String in every web page we serve out. Now I have a few options to look at.

Matt Nordhoff said...

I'm not much of an algorithms person, but it might be worth mentioning the developer of Adblock Plus's blog post on the subject. ABP has to see if the user's filters match the dozens or hundreds of image/script/whatever URLs in the page. It's written in JavaScript though.

Investigating filter matching algorithms

HashiDiKo said...

@egwor
Linked node (i.e. LinkedList) based data structures consume more memory than array based data structures because of the overhead associated w/ the node itself. In the case of Java, it's at least the cost of an Object.

In an unadorned trie the maximum size of the data being stored in each node is either 8bits, 16bits, or 32bits, depending on if you decided to store each letter as a byte, char, or int. For example, if you only care about ASCII, you can store each letter as a byte and save some RAM. The efficiency issue comes from the fact that for every letter (8, 16, or 32 bits) you need at least 64bits to store it. So that's 8, 4, and 2 times, respectively, as large as the data being stored. In the case where a node is a parent node the overhead is at least 128bits. 64bits for the parent and another 64 for the child. Generally, array based data structures don't suffer from these kinds of overhead issues, though they can waste space (i.e. Hashtable).

Now I have never actually tried implementing a trie using an array. I've just always thought it might be possible because of the way heap based data structures work. So it's partly an intellectual curiosity and partly a practical concern. The practical concern comes from the understanding that compact data structures are generally more [CPU] cache friendly than less compact data structures. And as I'm sure everybody knows, cache utilization has real performance consequences.

Anonymous said...

Quick hypothetical:

Let's say you are trying to filter DARP, DART, GOLLY, LYPOSUCTION, PEN1S, and OVARIES

And you get an e-mail that contains the string:

GOVARIES
or GOLYPOSUCTION
or GOLYPOSUCTIOVARIES

Of course, this is gibberish, but what kind of havoc would this reek? In the second case, it would start on the branch for GOLLY, and fail on the second L... but then does it have to go ahead and start a branch for the O to makes sure ovaries doesn't appear? Wouldn't this actually cause MORE comparisons? You'd have to navigate the tree for every single character.

Am I missing something? Seriously interested.

Other than that questions, great post.

Anonymous said...

Paul, you have described the Aho-Corasick algorithm -
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Maybe some of the other algorithms at http://en.wikipedia.org/wiki/String_searching_algorithm could provide better performance.

Tom said...

Check out the ragel state machine compiler (use Google) to do this even quicker.

simoncpu said...

In Soviet Russia, pen1ses will algorithmically grope you!

Mikhail Koryak said...

Codes or it didn't happen!

Anonymous said...

first love the service, second starting to love the blog!

hmm.. so did you put the search terms in a trie or possibly two tries (rather than the message contents)? the first trie would be based on all search terms, but would contain at most (n) letters where (n) is the length of the shortest search term. the partial words in the trie would be placed in reverse order. (this would accommodate the boyer-moore reverse search through the message and w/o overheard of loading the message in a trie.) the second trie would have the words longer than (n), but would start at (either n or n+1 i haven't thought it through yet). the words in the second trie would be in "forward order". the second trie would only need to be searched if the first trie contained a full match (and was not the shortest word(s)).

-dave

der Graph said...

I've enjoyed reading this post, and I somewhat admire the cliffhanger. But I'm horribly curious, and I'm still waiting for the post to be continued.