Monday, January 28, 2008

I'm vandalizing Wikipedia again

The "Problems for Users" section of Mailinator on Wikipedia reads:

Domains from services like Mailinator are banned from many websites for the reasons stated above. Mailinator provides alternate domains which work around this ban in many cases.

Another problem for users is that mail sent to Mailinator often disappears for unknown reasons. Before using Mailinator a user should first send it some test e-mails to determine the system's reliability for themselves.

Mailinator strips many headers from incoming e-mail. This means that users wanting to verify the e-mail sender or server may be unable to do so. And according to their FAQ "Plain text is best, html is filtered. Images, attachments, and fancy stuff are simply stripped away." Unfortunately this often results in incomplete or garbled messages.

The "Domains from services like Mailinator..." paragraph is accurate. Some sites do ban it but so far no sites I know of have kept up with our alternate domains (see our home page) for a recent list. I'm still baffled why websites send-away customers but I guess that's their business. If I'm not sure I'm going to want or use some webservice, I test the waters with mailinator address. If they don't let me, I usually then just don't bother testing it. I'm sure other folks then resolve to give them their real address, or go waste 5 minutes signing up for a fake yahoo address, but for me its rarely worth it.

The next paragraph describes that emails "disappear for unknown reasons". Um, I thought I was pretty clear in the FAQ about how Mailinator handles email. Emails live for a few hours - not a day, not 1 hour - a few. It's completely dependant on how much email its getting at the time. (As a data point, I notice right now emails are sticking around 183 minutes).

Although I fully agree with the latter part, if you want to test mailinator out - please do !!! Send anything you like :)

Finally the last paragraph states that we remove some headers. This is nuts. The code has never removed any headers whatsoever. And on every page showing an email, there is a "text-view" link where you can see the email PRECISELY as mailinator received it. Headers, dirty words, mime-encoded images, you name it. In other words, that paragraph is dead wrong.

I haven't seen any garbled emails for a long time either (probably was when the parsing code was originally written). We support most encodings, (japanese and russian emails look cool), I suppose if you send an email with some odd encoding that Java doesn't support or mis-specify an encoding things might look garbled. Regardless - if you see a garbled email - please let me know!

I tried to replace the last paragraph with one explaining that the "text-view" sort of proves its not true, but my change was undone immediately (so fast I'd guess it was automated) because it was "Wikipedia Vandalism". I'm not too fond of the paragraph that says emails "disappear" given that its pretty misleading, but God forbid I try to fix that one.

If you're a common Wikipedia poster - please feel free to fix this entry. All I'm hoping to get up there is (something closer to) the truth.

Monday, January 21, 2008

How to search for the word "pen1s" in 185 emails every second

SubTitle: How Mailinator searches every incoming email for a few hundred strings efficiently (in Java).

In a humble moment of geekiness, I must admit I love tinkering with algorithms and data structures. I'd also like to point out that this article is not some attempt to educate you as to the best algorithm for any purpose, it is simply a documentation of an engineering thought process (through all its wrongs and rights) in coding up one, tailored algorithm. If you'd like a far more comprehensive treatise on string searching, this book is quite good as is this paper and this one.

A simple pen1s problem

Mailinator gets a fair bit of email - it varies wildly but 185 per second is a pretty common rate. It needs to filter some of it, flag some of it as evil, or simply perform some categorization on it. A lot of it is junk and it clogs the system. Like I've said over and over - Mailinator is not against receiving spam. If you want to get your email about porn or viagra or latest stock tips at Mailinator please feel free. But if anyone sends a billion emails about mortgage refinancing or weener enlargement to my server, it might get smushed and it is probably going to try to thwart them. It's not about the content (although unlawful or hate stuff can stay away thanks) - it's about handling the load.

So one way to do this is to scan every email that comes in for some keywords to determine it's an email we want to flag (i.e. maybe reject, maybe quarantine, whatever) in some way. This simple method can be surprisingly fruitful and it's the idea we'll consider here.

As a real example, let's say we wanted to watch for any emails that come in containing the word "pen1s". Pardon the word choice, but it is seriously a glaring use-case of mine - in fact, that's well into the tame-side of what I see. Its also a pretty good word because spammers often use it to avoid word filters that try to filter out the real word "penis" but it backfires in a sense - it identifies their email as spam. And they use it a LOT.

Searching for a given word in a constant stream of incoming data is sort of the opposite problem that search engines have. They know the search data at all times (i.e. all web pages) - but they don't know what search term you're going to walk up and ask for. Mailinator in contrast knows the search terms at all times (i.e. "pen1s", "v14gr4", "d0nkeyL0ve" (no, I'm not sure what that means either)) but gets new search data continuously.

So if you're a Java programmer, you can simply think that for any incoming email you do:

boolean containsPen1s = (emailBodyText.indexOf("pen1s") > -1);

Right on. This should work. And if you try to guess the obvious way the indexOf method works, you're probably right. It has a for loop that scans through emailBodyText looking for a "p". If it finds a "p", it then checks the next letter to be an "e", and so on. (If you're an algorithmy type person and wondering if I mis-spoke and if indexOf actually uses some smarter algorithm like Boyer-Moore or Knuth-Morris-Pratt, I didn't and it doesn't).

So, lets step back and do some quick math on how much work this is. As we said, we have 185 emails per second. From very strong empirical evidence I can tell you that emails in the system average about 2k each (some much bigger, some smaller, but that's an average).

So.. if we have 2048 bytes per email and 185 emails/sec .. that means we have 378,880 bytes per second.

Now the normal case, we only find the word "pen1s" rarely. By far most emails don't contain it. A near-worst case might be if all our emails contained "pen1pen1pen1pen1.." but that would probably be even more rare.

If we had no false starts at all (i.e. no letter 'p' anywhere) we would have to do 378,880 comparisons per second. That is, we must compare every byte in every incoming email we get to the letter 'p'. In truth, we'll do more (false 'p' matches). In fact the worst case is (the number of bytes of email) * (the number of bytes of the search term). Concretely in our case thats 378,880 * 5. In my use case however, that multiplier is quite unlikely, although a factor 1.5 or so might not be. Regardless, for simplicity lets use 378,880 with the understanding that in practice, it will be some small constant factor higher than that.

Note that you might be well asking right now why I'm not using regular expressions. Good question! I tried that. A state machine should do this comparison in O(n) every time - false starts be damned. But Java regular expressions sure do not. I found them very much slower in every case to simple indexOf. This recent article agrees. Regular expression matching in Java might be convenient, but it sure isn't fast. A nice discussion of state machines, string matching, and a bit of talk about why Java's implementation is slow can be found here (very worth a read).

Great.. So 378,880 comparisons per second is annoying but not too bad. As best as I can measure it, this takes my Athlon 4200 nicely under 1 millisecond (leaving us at least 999 other milliseconds to other server work!).

Enlarging your word list

Unfortunately, that doesn't solve my whole problem - my actual needs grow everyday. And really, I need to search for a lot more words than just "pen1s". In fact my current list is around 100 different words (and growing). So to use indexOf, I'd need to change my search statement to:

boolean containsBadWord = (emailBodyText.indexOf("pen1s") > -1) ||
(emailBodyText.indexOf("weenypie") > -1) ||
(emailBodyText.indexOf("captaincheeseburger") > -1) ||
....and so on

Things are starting to get hairy. On a normal case where we find no matches, we do 378,880 comparisons per search term. If I have a 100 search terms that's 37,888,000 total comparisons. My benchmark now takes about 60milliseconds. This simple task is now taking 6% of every second. And as we add search terms, it keeps growing. What if I get to 500 search terms? or 1000?

One improvement we can do right away is to use an algorithm called Boyer-Moore (which has many flavors now - see the papers I referenced above). These guys had a simple but very important idea.

Why, in fact, do we search our string from the front? Why not instead, start searching from the back?

So let's say someone sends us an email with 1000 X's. No other letters. And we dutifully go ahead and search for "pen1s" (which we won't find of course).

The naive way (like indexOf), we do 1000 comparisons checking "p" against each position (which are all X's). But instead, let's say we start off by hopping ahead 5 positions and instead look for the "s" at the end of "pen1s". Of course we don't find it, but then instead of checking the next spot, we again skip ahead 5 and look for "s". In fact, because we'll get zero matches, we can skip ahead 5 each time.

    ^    ^    ^  ....
    s?   s?   s?  ...

if we don't find the 's' in any of those positions, that immediately rules out the possibility that "pen1s" precedes it.

So instead of 1000 comparisons - we only do 200 !

This is a constant speedup, but it can be quite dramatic. The algorithm has more complexity to it than I've illustrated (like what happens if you encounter an 'e' in one of the hops?) and it won't often run in its "best case" as I discussed above, but thats the gist. The worst case is actually still just as bad as our original search, but again, thats not the likely either.

So in the supremely wonderful case for this example, Boyer-Moore-esque searches cuts our work down by a factor of 5, making it 75,776 comparison for each term or 7,577,600 for the whole 100 (i.e. 12 milliseconds in my benchmark). The scaling problem is just as bad however - every time we add a word, we still add a 75,776 more comparisons.

Most people simply assume that an O(n^2) algorithm is "slower" than an O(n) algorithm. Thats usually true in practice but it doesn't have to be. Big-Oh notation is really an illustration of growth, not speed. If you increment n by one, the O(n^2) will have to do n more operations to complete, the O(n) algorithm will only have to do 1 more.

Boyer-Moore in the best case is O(n/m) where n is the number of bytes being searched and m is the length of the search term (which in our example was 5). We know from algorithms class that O(n/5) is effectively the same as O(n) - because they both grow linearly. Yes, the O(n/5) will run five times faster than the O(n) algorithm - but as n grows, they both grow relatively the same. So if we can handle 100 search terms with indexOf at O(n) then we can probably handle 500 using Boyer-Moore. But we've really just delayed the problem. Further increases will still grow our algorithm out of reach.

Improvement here is based on 2 things:
1) How important the speed here is to your application and/or how big your library of search terms become.
2) How much of a kick you get out of improving, coding, and tinkering with algorithms just for fun even though it really might not be worth it when you probably should be out picking up girls (or guys), skiing, playing Battlefield2 or something.

In truth, our original indexOf is good enough for many applications. And Boyer-moore handles even more. But I, unfortunately often subscribe to rule #2.. so on we trudge...

The lack of scalability starts to bite us. The number of comparisons grows by 378,880 each time we add a new search term for indexOf and by 75,776 for Boyer-Moore.

Trie this

Ok.. how about another idea. How about if we organize our data better. We know all our search terms up front, what if we could build a structure that was much easier to search.

One such structure is called a Trie (pronounced like "try" even though it supposedly comes from the center of the word "reTRIEval"). So again, lets say our search terms are:

"pen1s", "darn", "golly", and "dart"

We construct a trie as a graph of connected nodes where each node (or edge really) represents each letter as so:

       / | \
      d  g  p
     /   |   \
    a    o    e
   /     |     \
  r      l      n
 / \     |       \
n   t    l        1
         |         \
         y          s

So each time we need to search we start at the root node and see if we can head down. If we encounter "darp", we'll track down 'd', then 'a', then 'r', but find no 'p' node. Hence we know we do not have a match.

How many comparisons must we do? 378,880. The running time now is back to our single-term indexOf method and is at O(n*m). But notice, it isn't dependent upon how many search terms we have.

Think about that - that's crazy - in effect, you can search for a (theoretical) infinite number of search terms simultaneously all for the cost of ONE indexOf ! (This is of course, the best case scenario).

In other words, we're back to doing 378,880 comparisons but now we can search all 100 search terms with just those 378,880 comparisons. And you want to add 100 more? Sure, still 378,880. 1000 more? no problem. Still 378,880.

The only thing that gets hairy with tries is storage as those nodes can be rather bulky (as you'll see if you try to implement one).

One thing you might notice is that although we gained the ability to search for any number of terms at once, we lost the Boyer-Moored-ness. We must search every byte. That was only a constant speedup, but still it was rather nice.

The algorithm used in Mailinator actually tries to put the Boyer-Moored-ness back (with interesting results). In other words, it tries to search any number of terms while only doing O(n/m) comparisons in the best case (where n is the number of bytes to be searched and m is the length of the shortest search term).

This post is plenty long enough. To be continued....

(Thanks to Frank Yellin for reviewing this post.)

Your own private Mailinator

As you probably know, Mailinator has a always-growing list of alternate domains. However, those are just the domains sponsored by us.

In truth - Mailinator accepts ANY email you fling at it*. In other words - you can setup your OWN personal domains simply by setting your MX record (i.e. through your registrar or DNS server) to point your mail delivery to Mailinator!

Simply put - if you own, you can simply change that domain's MX record to point to and all email you send to will then end up in Mailnator.

You can publicize this fact - or you can keep your personal alternate domain a secret and have your own private Mailinator !

*pursuant to normal draconian Mailinator email rejection criteria including but not limited to hate email, unlawful email, etc.

Saturday, January 12, 2008

New Engine, New Features

A few months back I finally hit a wall enough with trying to add features to my original design of Mailinator. As with any piece of code - it had aged. I don't immediately impune such code as being bad or poorly designed, its just that code needs TLC every now and then -and- developers learn more and more about their domain as time goes on.

Anyway, a few months back I set out to do a full engine rewrite. Partially to redesign things to support some future features I want to do, but also to streamline the whole thing and fix a pile of bugs that had been floating around

One thing the old engine relied on a lot was Java Strings. And that would all be fine and dandy except that at the core, most servers are really about bytes, not Strings. Front ends are about strings, browsers are about Strings, but at the server, strings are just programmer abstractions for thinking about the data. The data itself is just bytes.

So the new engine is bytes.

I also did a lot of testing with how Java handles its networking. And how the 80-bazillion smtp servers out there send their data (and let me tell you, the RFCs which spell out the rules are most definitely just "guidelines" - those rules are broken all the time). I got pretty nitty-gritty and as I am wont to do when I'm writing code for fun, I wrote a ton of stuff from scratch.

Anyway - the good news is that the server can, according to my tests, handle about 1.5times the load it could before. I'm not really sure what the true abilities are as its hard to loadtest without a "real" few thousand machine hitting the server.

We're still running on one server (altho its beefed up a bit only because ServerBeach had some hard-to-pass-up deals) and I'm guessing it will scale easily to 30million emails a day. The max I've seen is 16million in one day and that was the old server code so I'm pretty conservative on the 30million (I expect the network to bottleneck before the server actually).

Hopefully, users won't really notice any of what I've mentioned so far - that is, hopefully mailinator will just work as great as before.

Things you will notice is that we have better language support - its seriously cool seeing all the Japanese and Russian spam fly by. It also supports encodings such as quoted-printable and base64 much better.

And finally.. I added a quickie feature that was suggested to me literally years ago. You can now access mailboxes right from your browser. Instead of going to and typing in your mailbox name into the form, just type:

And it will web-a-port you right to that mailbox. Mailinator is about removing inertia from getting email - and this feature is just another feather in its cap. : Anatomy of a Spammy Campaign

Mailinator is a popular disposable email service. It's also become a great tool for QA Teams to test email receipt, acknowledgment, au...