Monday, December 8, 2008

Dear World, email addresses are not identity

Its no secret that part of putting up any website or service is the consideration of security measures to stop people from abusing the system. And as you can imagine, those particular issues are probably doubled or tripled with a site like Mailinator (and of course Talkinator). Needless to say, I've gotten pretty good at unorthodox security.

Anonymity does indeed breed bravery.

The normal Mailinator use case is that you have a need for a quick email address to sign-up for some web service. Some people however, use Mailinator as an actual primary email repository. With the RSS feed, it acts as a convenient dropbox for newsletters or other semi-private email needs.

Unfortunately, there are some wacky funsters out there that think it might be fun to sign-up for some website over and over and over and over (and over). Often this is for some site that has "vote for your favorite band" or "sign up for a free gift" or some such thing.

The primary flaws of these systems is not Mailinator - its that these websites equate the idea of identity with email addresses. Seriously, long before Mailinator existed I had 3 or 4 email accounts that I actively used and another 10 I had probably abandoned. I'm not sure what those site designers were thinking - Mailinator or not, email addresses are FREE.

If you give someone something that has any value at all in exchange for free email addresses, they're going to ask for lots. I'm probably in a unique position to view this, but I see this idea as incredibly broken.

One option I had was to simply ignore this idea. Let crafty script-writers create systems that sign up for Wink accounts 1000 times an hour. (As I write this, I'm watching some automated system using several hundred IPs trying to sign-up for using Mailinator - and watching the Mailinator system dutifully send each request to the abuse page).

The problem with this is that someday, will catch on. And they'll ban Mailinator. This is sadly, a wonderfully broken solution to a still existing broken site design.

The problem for me is that I likely have legitimate users that want to sign up for Wink - and I want them able to do so (and I imagine Wink might want more users too, so by extension they'll lose some or all of the ones I lost). What's insanely broken for sites banning Mailinator is that there are tens of Mailinator-copy-cat disposable web services out there. Or even worse, someone with access to a server and a domain, who can install sendmail and create a few thousand accounts. Simply put, banning mailinator is like catching a single mouse and thinking you've solved the mouse problem.

You stop the bad guys, but for about a day until they implement a new system.

I had an interesting discussion with an acquaintance recently. During the conversation I described Mailinator to him. His mouth gaped open and told me he would look into it and probably ban it from his site. I asked what he would be banning it "from". He said he had a trial piece of software that people could sign-up for and download. And he wanted their real information to email them later (i.e. I did my best not to say that he was sending "spam") to see if they wanted to buy.

I noted that sometimes when I download software to try, I do want to enter my real email. I'm interested enough to want to be registered. But other times, I'm just in browsing mode. If given the chance I'd download and check it out, but if you give me too much impedance I'll probably just go check out his competitor.

In those cases when I'm just browsing, I'll use mailinator.

In other words, there are 3 types of potential customers. Those that don't care about his software. Those who really love the idea of trying his software and will do anything to do so. And those who are on the fence.

For obvious reasons, Mailinator is my "on-the-fence" tool of choice. If he banned it, he'd be refusing some subset of those potential customers. So it basically comes down to the question - whats better?

1) Definitely get user information you can spam later - or
2) get your product in front of as many eyeballs as possible.

Also noting the fact that NO email insures any relation to an actual person whatsoever (including yahoo, gmail, hotmail, etc.) - whats the point?

We continued our discussion and agreed that from a marketing perspective, you actually don't want to remove the email sign-up altogether. It actually brings value to some customers. If you remove it or make it optional, most everyone will skip it just to get to the goodies. But by leaving it and knowing that some people, using Mailinator or Yahoo or whatever, will give you temporary email addresses, you're maximizing your potential customer base.

It didn't hurt my argument to mention a few other disposable email services that he'd have to ban too. I sure don't know them all - they seem to come and go a lot. And that surely doesn't count ones that run semi-privately. Basically, it would be a fulltime job to keep up.

Oh. So, back to our script kiddies above. Mailinator includes a system to stop scripts from signing up for websites over and over. I love fun algorithms/data-structures so your homework can be to design something like Mailinator's abuse trigger system - a key-value datastruct that ages with time and is refreshed by lookups that come in at some notable (and tweaked) rate (in the same ballpark as a LRU cache, but definitely more dynamic).

Its unlikely a human will set-off the triggers but its possible. The sad part (for script writers) is that the algorithm doesn't trigger until their script gets going, so its probably a bit heart-breaking to spend a few hours perfecting a script to scrape Mailinator and then have Mailinator detect it only once it gets going and shut it down hard.

The first level is the Abuse Page. If you push it, Mailinator will ban IP addresses - but only under certain conditions. That's rather an imprecise way of stopping abuse. In addition, it looks for patterns of mailbox usage regardless of IP. An obvious one is that if one subject "Welcome to Wink!" shows up a lot in the read emails. Sadly, its difficult to distinguish valid users trying to sign-up for wink amongst the botnet hitting right now - so they'll probably get the abuse page too for the time being.

Potential site abusers taught me a lot and hardened the site considerably. Abuse attempts are still a common occurrence but far less normal than a few years ago. I assume many scripters went to less caring disposable services.

I often get asked if I care if sites ban Mailinator. I don't really. In some cases its prudent if you really do need to email people that use your service. In most cases however, its simply a knee-jerk reaction attempting to patch an otherwise flawed system. Not only is it a sure way to eliminate some potential customers, the flaw will show up again soon when the abusers shift to another method - and probably another method without Mailinator's facilities to stop scripts.

In the end, there is no real identity on the Internet. At least none past an IP address and a subpoena. At best, email is optional identity. And prudently, it should probably be treated that way.

Thursday, December 4, 2008

Reserved Names in Talkinator

Talkinator now includes a beta feature (ok, all features are beta) to allow reserved names. Each room can have only one and the person owning the site (and cut and pasting the code) designates the reserved name and the password to access it.

If you already have a Talkinator in your site, just click the </> icon (i.e. get the code!) in the upper right of any talkinator chatbox and create your Talkinator with your reserved name!

Please let us know how it works and especially if you find any bugs! (email to

Tuesday, November 11, 2008

So, Mailinator's core-duo server now gets 3 Terabyte a month

Its been awhile since I've updated my article The Architecture of Mailinator. That was a long time ago and really and tons has changed. To save you reading that article if you're in a hurry, the crux was that Mailinator was originally based on the unix email program sendmail, which pretty much died at around 800,000 emails a day. And then it discusses the design of the fully custom SMTP server designed for mailinator.

In that article, Mailinator was running on a single AMD cpu and processing 3 million emails a day or so. Well, somewhere in there I got a deal with my hosting company and was running a dual dual-core opteron machine. Mailinator ran on that for about a year as our email load doubled a time or two.

Eventually I needed a server for another project and realized that giving Mailinator that dual-opteron was pretty overkill. So I "downgraded" the machine to a core-duo. Which is where it runs now. In the aforementioned article we were getting 3 million emails a day - now I sometimes see 2 million an hour.

For the geeks out there, this is still of course a highly-multithreaded java implementation often running 4000 simultaneous threads. Pure Java I/O (not that stinky and slow NIO.

Today I got an email from my hosting server that I "surpassed my allotted bandwidth". I'm pretty used to getting subpoena's and such, but not so many of these.

Apparently, Mailinator received 3.1 Terabytes of email in the last 4 weeks.

Thats (ballpark) 480 million emails or so in that 4 weeks. As I've said before, that doesn't compare to any bigger email service, but then again, I bet they have more than one server :) I'll also emphasize that numbers I usually throw up here were measured or calculated by me - that 3T number comes from my provider so I hope its accurate as they want money for it !

Thats a LOT of spam (sure there's web traffic in there which might be hefty in other circumstances, but it doesn't hold a candle against the mail traffic).

I've priced things out and although my server still has plenty of computing capacity left, I'm clearly hitting an artificial (i.e. pay for more) bandwidth limit. It makes far more economical sense to simply get another server (per all pricing schemes I find) than to buy more bandwidth on this one.

So, someday I'll get around to writing the next "Architecture of Mailinator" article and although it almost sounds funny to me after all this time, I'll likely be talking about "both" Mailinator servers. Not just our trust little core duo.

Monday, November 3, 2008

The Revolutionary Fuzzy-Logical Super-Duper Mailinator Captcha System(tm)

So, after many years of resistance for various reasons, Mailinator has finally implemented a DELETE button. That's right. The number one frequently-asked-question that isn't in the FAQ has been answered.

The reasons for its delay are many. In the early days I was too worried that someone would write a script that would attack the system and delete every email out there. Then after a year or two of people writing scripts to attack Mailinator in every other way - I had enough AI built into the system to stop most of it. (As always, I define "attack" as something that could bring the system down or ruin its use for other users - spamming Mailinator is really not on the radar, its not our place to define what is spam in this context).

Subsequently, I looked at implementing it a few times but tended to back off. If you've read this blog awhile you know that Mailinator and Talkinator at least partially existed as a testbed for implementing server ideas (some earlier articles: The Architecture of Mailinator and Benchmarking Talkinator).

And while implementing a delete function might sound simple - it had some interesting complexities in the architecture. In any case I'm happy to say - its finally here.

And here's the best news. To thwart all those rascally script-kiddies (who probably have seen the rather hidden but sometimes funny Mailinator Abuse Page) we're proud to announce the "Revolutionary Fuzzy-Logical Super-Duper Mailinator Captcha System(tm)" !!!!!

Thats right. The great service and ironclad spam (fighting/loving/tolerating, your call) abilities of Mailinator are now in a REVOLUTIONARY captcha system! We could have used a pre-written captcha system but that would have been sadly inferior and very non-mailinatory!

Just look at it and you'll know that this is going to take some major AI, image processing and statistical analysis to crack past this sucker.

Oh. And one more thing.... (got that from Steve Jobs)

We at Mailinator are happy to announce that this Captcha system is just TOO GOOD not to share with the world. So, you guessed it, we're releasing it as a web service !

You can now include The Revolutionary Fuzzy-Logical Super-Duper Mailinator Captcha System(tm) in your websites too !!!

Here's the highly-technical, low-level URL:

(edit: url momentarily removed due to immense user respo.. ok.. just a bug... be bak soon)

(where yourDestination.html is where you want the captcha to go after very securely and precisely verifying the humanness (and also non-computerness) of your user).

Put this in your web pages and sleep well!

(Disclaimer: This captcha system, although highly-advanced and chock-full of the latest advances in modern artificial intelligence does a better-than-expected job of verifying humans are humans. However we've noticed an eensy bug that seems to prevent it from sometimes (or "usually" really (actually its probably more like "mostly" (sometimes its pretty "always" tho))) figuring out that computers aren't humans. So consider this and "alpha" release.. or maybe "pre-alpha" or something like that. A "prototype", yeah, its a "prototype").

PS: Have your own idea for a captcha? Make the image 80x24 pixels and send it to !

PPS: The delete function is in beta. Please let us know of any bugs.

Thursday, October 23, 2008

The Developer's Mailinator

I get a surprising number of emails from developers testing some sort of emailing system and wondering if its "ok" to send those emails to Mailinator.

Of course - the answer is yes!

In fact, thanks to an email from one of those developers named Dave, I've actived port 587 on Mailinator to be an active receiving port (in addition to port 25).

Mailinator has always accepted email addressed to any domain - (i.e. have a domain name? point your MX record to the mailinator server and it will arrive just fine) so it really is a pretty solid testbed. Remember, you can see the raw email (headers and all) by clicking the "text" link on any mailinator show-email page.

The only caveat here is that Mailinator's defenses don't come down ever - even for testing. If you want to send a few thousand emails to Mailinator, it very well might start throwing emails away. As is sort of our goal - we don't mind spam (actually - given the nature of Mailinator - who are we to even define what spam is?). But we ARE interested in the Mailinator service staying up and running. Any legal (and nice - please read terms of service) use is fine by us - but if Mailinator thinks you're being overzealous, we'll probably side with it :)

Tuesday, September 2, 2008

Mailinator for your IGoogle Homepage

Frank Ellermann has created a very pretty integration of Mailinator into the iGoogle homepage. It now installs with just a click!

Thanks Frank!

See it here: Mailinator iGoogle.

Wednesday, July 30, 2008

Bob Cringely on Talkinator

Bob Cringely, well-known computer industry guy wrote a nice article about the "Five percent solution". He uses Talkinator as prime example!

I like this idea - I remember many years ago someone telling me about Google and I thought they were nuts - "I have Altavista - why do I need this Google thing? (silly name by the way)". Or even a bigger one for me was when I first heard of YouTube - so some little guy is going against an already established Google Video with their only real distinguishing feature is that they only allow "short videos"?

Bob's right - sometimes a very established idea simply needs a 5% twist to take it to the next level.

Thanks Bob !

Thursday, July 24, 2008

Talkinator notes

When Mailinator came out, people simply couldn't get over the fact that there was no login - I mean, how can you have email that doesn't have a login and password? It was a very strong paradigm shift for some folks.

For Talkinator, this paradigm shift has so far been the shock of "all of a sudden" being in a live discussion in a Talkinator room. Again, they didn't sign in or realize it would happen. Its rather common for people to not believe its actually live (often they assume its some sort of advertisement).

It took awhile for people to get-it as far as how Mailinator works. Paradigm shifts come slow - but they're often a lot of fun.

Saturday, April 12, 2008

Mobile Mailinator

Great idea - a nice write-up on how to use Mailinator on your mobile phone!

Sunday, March 30, 2008

How fast is Java Volatile? or Atomic? or Synchronized?

This is a question I've wondered awhile. And I've gone ahead and made a stab at finding out. One question that I had was I had always heard that AMD chips have an on-cpu memory controller and Intel's don't (at least not until they're upcoming Nehalem line). Now I'm first to admit I'm talking above my knowledge (please correct profusely).

But the idea was that some operations (notably I've heard volatile reads which I didnt test here) are very similar regardless of volatility.

I spent a few hours making up a benchmark. I'm publishing the benchmark below for one important reason - to get it right. This is big disclaimer: Don't trust these results - at least with any precision. I do conclude that contended synchronization is expensive - thats a pretty expected result so hopefully we're somewhere on the right track.

However, Java benchmarks are very hard to right. So hard in fact that they do a hardness-overflow and end up looking easy - which is precisely what makes them so hard to get right. Hence - given that only I have seen this benchmark, its pretty likely to be broken some (and hence why its published here for public review to help fix).

With all that...

What I've tested here are storage operations. Specifically:

Storage to a local variable (mostly used as a control)
Storage to an instance variable
Storage to a static variable
Storage to a static volatile variable
Storage to a static AtomicLong variable
Storage to a static variable protected by syncrhonization

I tested this on 3 different CPUs, all with JDK1.6 and variously on Linux and Windows. One very important point is do *not* compare the absolute numbers in the graphs directly. In no case are comparing the same CPU across 2 operating systems. The CPUs are wildly different in capability.

So we can explain more with a graph:

This is the run on a dual-core dual-Opteron machine. Note the blue-bar is local variable storage. As you'd expect it simply destroys everything else. This is very probably because the smart JVM went in and realized it could optimize the assignment out of the loop. Note the blue-bar just gets happier and happier as we add cores. The threads are completely unaffected by each other and we get actual linear increases in power. Even at 5 threads (note we have 4 cores) we get some speedup.

The orange and yellow bars are instance variable and static variable stores respectively. Some level of indirection or inability to optimize jumps in and makes things a bit more sane. Surprising we get little or no improvement as we add threads.

The three thread-visible methods of store are so slow they barely show up. This brings up an important point, let's assume you wrote an application and made it correctly thread safe. Using synchronization or volatiles or whatever. You *cannot* optimize away synchronization in the name of performance. It simply can't be done. (conversely, if you could then it wasn't correctly thread-safe to start). You can however, at times, replace one type of synchronization primitive with another.

That being said, lets take a closer look at the same graph with just the barrier memory store operations.

Volatile and Atomic are neck-and-neck. Synchronization is still a big loser, especially after we contend giving it two threads.

One other anecdote - while running the non-memory-barrier benchmarks, my CPU meter showed 100% user space usage. With these benchmarks however, it went to 30-40% kernel cpu usage.

Ok, jumping to the Core2Quad Extreme CPU. Definitely a faster processor but with a different memory architecture.

Local store again goes flying. Oddly the instance store doesn't do much but the static store increases nicely with the cores. And, once again, we can't even fricken see the barriered writes.. so here they are.

Look how great the JVM does knowing that just one thread is out there. It can really optimize the code to elide or at least marginalize the impact. Surprisingly, synchronization still gets nailed across the board.

Note that according to what I have so far, a synchronized static store is something 50 times slower than a simple static store. And 10 or so times slower than a volatile static store.

One more.. and its a fun one. Windows XP running on a Dothan. (Funny, I realized I hadn't directly known that Dothan was single core, but I just assumed it after I saw the graphs.)

Crazy. Add a few threads and performance doesn't even budge. Of course, with only one core a system can completely avoid all the complex architecture keeping cores in sync. Why its local writes are worse as compared to instance beats me.

Also, although I said don't compare graphs, this little single core Dothan beats the Core2Quad in all barrier-ed writes after 2 threads. Note that on the local writes, the Core2Quad is something like 50 time faster. But even on the simple static volatile store - at 2 threads, the Dothan is now twice as fast.

So. Pardon my informalness here - I'm actually quite expecting feedback to break this code in grand ways and force me to redo all the runs and rewrite all the text (which I'll be happy to do if we get a clean bench out of this). I have no intention of making very specific claims once this benchmark is firmed up - but I would like to have a "sense" of the costs of these different operations.

Source code for the Benchmark.

Thursday, March 27, 2008

Introducing: Alternate Inbox Names

Mailinator helps thousands of people every day avoid spam. It's been a great success.

One point that always sort of bothered me however was that by giving out a Mailinator address, you were basically telling people not only how to email you, but also how they can check *your* email. If I say "Email me at", the whole world knows where to send and read my email!

Well, this is no longer a problem. Today we added "Alternate Inboxes" to Mailinator. We've always had alternate domains, but this is quite different.

Now, if you check any inbox on Mailinator (say, binkypop), listed on the page is the alternate inbox name. All alternate inbox names start with "M8R-". For example, for binkypop, the alternate inbox name is

So simply put, if you email binkypop@mailinator OR it doesn't matter. Both emails would end up in the binkypop inbox (and nothing in the
M8R-yg1hkn inbox). The only way to find out an alternate inbox name is to check the inbox here first - thus if you give out the alternate address, there is no way for anyone to guess that it actually goes to binkypop.

So.. pick yourself a favorite Mailinator inbox name (make it big, long, and hard-to-guess please!), go check the inbox to find out the alternate inbox name, and hand it out all over the web (of course, alternate domains work on alternate inboxes too!).

Then check the box (or RSS it) knowing only you know thats the real destination. Also, keep in mind - you don't have to remember the alternate inbox name ever - you just have to remember the real destination address. You can always go to Mailinator and get the alternate anytime just by checking the inbox. And of course, the regular think-up-on-the-fly-and-use Mailinator you know and love hasn't changed a bit. This feature is purely optional (I'm surprised how fast people started using it!)

(This feature is in beta and we might end up changing the alternate address scheme some)

This is cool. Alternate inboxes really do up the ante in Mailinator's spam fighting abilities. Enjoy !

Tuesday, March 4, 2008

Fun Mailinator server stats

All averages have some fuzziness in them. Max numbers were observed but are unlikely the actual max.

100% custom SMTP Server written in Java using blocking I/O, multiple (at times thousands) of threads, and liberal use of non-blocking data structures.

Average daily incoming emails: 6.5 million
Average daily incoming bandwidth: 18.4G
Average # of threads running at any time: ~450
load average: (uptime cmd) 0.04, 0.05, 0.00

Average email size: 2194 bytes
Ram devoted to email storage (compressed): 400M-500M
Average number of stored messages: 218,675
Average number of active inboxes: 94.575
Average memory allocated to Java VM: 880M

ax observed emails in one hour: 1,000,085
Max observed emails in one second: 1,274

Monday, February 18, 2008

Kill the myth please. NIO is *not* faster than IO

I'm giving 3 talks at the Software Developer's Conference West in Santa Clara, CA on March 3-7.

One of them is a tutorial on how to interview in Silicon Valley (fun stuff).

Another is how to write high-performance servers in Java. Specifically, why NIO is really not the best way anymore (post linux 2.6 kernel and NPTL) and multithreaded I/O is the new old-way of doing things. Its a fun talk and largely discusses the internals of Mailinator and how it runs a few thousand simultaneous threads without breaking a sweat.

On top of that, as it turns out, in pure throughput, IO smokes NIO in all tests I tried. And I'm not alone - Rahul Bhargava of Rascal Systems did a very nice analysis of this and posted it, sadly, in some forums at

The post is solid gold and I'm posting it below just for posterity as I'd hate to see those forums come down some day and I lose access to that post. Incidentally, I disagree with Rahul that "fewer threads are easier to debug". It only takes 2 to make a deadlock or race condition.

The original URL for this post is:

And like I said, it was written by Rahul Bhargava of Rascal Systems. Here it is:


I have been benchmarking Java NIO with various JDKs on Linux. Server is
running on a 2 CPU 1.7 GHz, 1GB RAM, Ultra160 SCSI 36GB disk

With Linux kernel 2.6.5 (Gentoo) I had NPTL turned on and support for
epoll compiled in. The server application was designed to support
multiple disptach models :

1. Reactor with Iterative Disptach with multiple selector threads. Essentially
the accepted connections were load-balanced between varying number of
selector threads. The benchmark then applied a step function to experimentally
determine the optimal # of threads and connection per selector ratio.

2. Also a simple concurrent blocking disptach model was supported. This is
essentially a reader thread per connection model.

Client application opens concurrent persistent connections to the server
and starts blasting messages. Server just reads the messages and does
basic un-marshalling to ensure message is ok.

Results were interesting:

1. With NPTL on, Sun and Blackwidow JVM 1.4.2 scaled easily to 5000+ threads. Blocking
model was consistently 25-35% faster than using NIO selectors. Lot of techniques suggested
by EmberIO folks were employed - using multiple selectors, doing multiple (2) reads if the first
read returned EAGAIN equivalent in Java. Yet we couldn't beat the plain thread per connection model
with Linux NPTL.

2. To work around not so performant/scalable poll() implementation on Linux's we tried using
epoll with Blackwidow JVM on a 2.6.5 kernel. While epoll improved the over scalability, the
performance still remained 25% below the vanilla thread per connection model. With epoll
we needed lot fewer threads to get to the best performance mark that we could get out of NIO.

Here are some numbers:

(cc = Concurrent Persistent Connections, bs = Is blocking server mode on Flag,
st = Number of server threads, ct = Connections handled per thread,
thruput = thruput of the server )

cc, bs,st,ct, thruput

As you can see the last line indicates vanilla blocking server (thread per connection)
produced the best thruput even with 1700 threads active in the JVM.

With epoll, the best run was with 2 threads each handling around 850 connections in
their selector set. But the thruput is below the blocking server thruput by 25%!

Results shows that the cost of NIO selectors coupled with OS polling mechanism (in
this case efficient epoll VS selector/poll) has a significant overhead compared to
the cost of context switching 1700 threads on an NPTL Linux kernel.

Without NPTL of course it's a different story. The blocking server just melts at 400 concurrent
connections! We have run the test upto 10K connections and the blocking server outperformed
NIO driven selector based server by same margin. Moral of the story - NIO arrives at the scene
a little too late - with adequate RAM and better threading models (NPTL), performance gains
of NIO don't show up.

Sun's JVM doesn't support epoll() so we couldn't use epoll with it. Normal poll() based
selector from Sun didn't perform as well. We needed to reduce the number of connections
per thread to a small number (~ 6-10) to get comprabale numbers to epoll based selector.
That meant running lot more selector threads kind of defeats the purpose of multiplexed IO.
The benchmarks also dispell the myth created by Matt Welsh et al (SEDA) that a single
threaded reactor can keep up with the network. On a 100Mbps ethernet that was true: network
got saturated prior to server CPUs but with > 1Gbps network, we needed multiple selectors
to saturate the network. One single selector's performance was abysmal (5-6x slower than
concurrent connections)

For application that want to have fewer number of threads for debuggability etc, NIO may be
the way to go. The 25-35% performance hit may be acceptable to many apps. Fewer threads
also means easier debugging, it's a pain to attach a profiler or a debugger to a server hosting
1000+ threads :-) . Bottom line with better MT support in kernels (Linux already with NPTL), one
needs to re-consider the thread per connection model

Rahul Bhargava
CTO, Rascal Systems

Thursday, February 7, 2008

Mailinator allows 831375028249471605232943589 trillion different email addresses. Please choose a good one.

Thats just an approximation really (allowing for 36 possible characters (there's more really) and email names of up to 25 characters).

Regardless, there's a lot of possibilities. So just a friendly reminder - please use Mailinator !!! But choose an offbeat name.
There's plenty to choose from :)

Update: math error, number was too low

Friday, February 1, 2008

More on the "pen1s" post

In my post about searching 185 mailinator emails per second in Mailinator, I think I failed to adequately explain the goal of it. Its intent was to document a thought process - it was not intended nor tried to be a taxonomy of search algorithms (I gave some very nice links if you're looking for that). I really don't think the world needs another taxonomy of string searching algorithms, or at least not one written by me.. Nor does it need another academic analysis of any existing algorithms - there are plenty of those out there (and I linked Wikipedia in that post too where appropriate to describe specific algorithms).

Over a long history of reading about algorithms, I realized that I rarely found many analysis of algorithms from an engineer viewpoint. I am a recovering academician, but engineering is where you get to tinker with things. I've read many times Boyer-moore's best case is O(n/m) - but whats the *real* case? The usual case? How can that influence me in using any of these algorithms? Many searching algorithms display their worst case when you search for stuff like "AAAA" and your search string is "AAAABAAAABAAAAAB" - well guess what, I don't often search for that kind of thing in practice.

Also many people don't truly understand Big O notation (my great aunt made incredible homemade donuts, but ask her about algorithmic analysis? not so much). Considering some examples in the concrete, especially realistic cases can provide insight. For example, tell your favorite non-algorithmic person that there is an algorithm called "bubble sort", and it sorts numbers for you. But in order to sort 100 numbers, it has to look at each of those numbers on the order of 100 times. Effectively it must do on the order of 10,000 comparisons to sort 100 numbers - they'd look at you like you're crazy.

That post (and future ones in the series if i write them) attempts to detail an evolutionary engineering thought process on designing an algorithm for a specific purpose. The article started with the canonical "naive" but undeniably correct solution. When considering any problem, correctness is in my opinion, an excellent place to start.

I did not "forgot" to mention any other searching algorithms. They simply didn't not yet arise in this thought process (several people mentioned Aho-Corasick, which I didnt mention by name but certainly alluded to its features when I mentioned the finite state machines of regular expressions). A Trie is a basis for AC, but its definitely not AC (which others seemed confused about).

As I said, we're by far mostly interested in the "usual case" running time (which future installments will empirically demonstrate). I was also pretty clear in the last installment that what I had presented so far was *not* was I was using in Mailinator. Some people seemed to miss that (they people seem so excited to comment, they must have not actually read much of the post). I'm not sure how far this thread of blog posts will go - again, I'm documenting my thinking.

And very importantly, I was also clear (and trying to be clearer) that I am in some sense over-engineering this problem. I know that - that is basically a goal here. I likely don't need the performance I ultimately will end up with. This is not science - its fun.

So to concisely re-state the purpose of the blog entry - I'm documenting the thought process that coincided with evolving needs for an over-engineered multi-term string searching algorithm. Thought processes almost by nature go down wrong avenues - which isn't bad, its educational. The interesting parameters are such that we have a hundred or so words (expecting that to grow) and rarely find any of them. Engineering analysis is limited as compared to academic analysis in that you give specific cases and values - you are really only able to look at one instance of a big problem. But again, that all I'm trying to do here.

Thus, If you're looking for a in-depth explanation of Boyer-Moore (which some people who didn't read the sentence in the last post where I said I wasn't providing that) seemed to want, or you want a nice taxonomy of effective string searching algorithms, then you should probably go read something like that. Because this.. wasn't that.

As other interesting facts, I chose 185 emails/second because it was a relatively representative rate for Mailinator. I truly don't know Mailinator's max rate - I've seen 1200 emails/second and I've seen 10 emails/second. The number 185 was chosen as representative, not as the only possibility.

Finally, as I said, I'm aware of algorithms such as Rabin-Karp, Aho-Corasick, and Wu-Manber (in fact, I see Udi most days at work). From a storytelling standpoint, it wouldn't be much of a tale if I didn't get an interesting result in the end. Aho-Corasick is linear on the search space and Wu-Manber is nicely sub-linear. Not to give it away, but Mailinator's algorithm is also sub-linear. It isn't Wu-Manber and all truth be told, Wu-Manber is notably better in memory use and worst case running time.

If you're asking then why should you use such a thing? I'm not advocating you should. Again, the idea here isn't so much the final algorithm (which is really just the punchline). Its really way more about how we get there.

Monday, January 28, 2008

I'm vandalizing Wikipedia again

The "Problems for Users" section of Mailinator on Wikipedia reads:

Domains from services like Mailinator are banned from many websites for the reasons stated above. Mailinator provides alternate domains which work around this ban in many cases.

Another problem for users is that mail sent to Mailinator often disappears for unknown reasons. Before using Mailinator a user should first send it some test e-mails to determine the system's reliability for themselves.

Mailinator strips many headers from incoming e-mail. This means that users wanting to verify the e-mail sender or server may be unable to do so. And according to their FAQ "Plain text is best, html is filtered. Images, attachments, and fancy stuff are simply stripped away." Unfortunately this often results in incomplete or garbled messages.

The "Domains from services like Mailinator..." paragraph is accurate. Some sites do ban it but so far no sites I know of have kept up with our alternate domains (see our home page) for a recent list. I'm still baffled why websites send-away customers but I guess that's their business. If I'm not sure I'm going to want or use some webservice, I test the waters with mailinator address. If they don't let me, I usually then just don't bother testing it. I'm sure other folks then resolve to give them their real address, or go waste 5 minutes signing up for a fake yahoo address, but for me its rarely worth it.

The next paragraph describes that emails "disappear for unknown reasons". Um, I thought I was pretty clear in the FAQ about how Mailinator handles email. Emails live for a few hours - not a day, not 1 hour - a few. It's completely dependant on how much email its getting at the time. (As a data point, I notice right now emails are sticking around 183 minutes).

Although I fully agree with the latter part, if you want to test mailinator out - please do !!! Send anything you like :)

Finally the last paragraph states that we remove some headers. This is nuts. The code has never removed any headers whatsoever. And on every page showing an email, there is a "text-view" link where you can see the email PRECISELY as mailinator received it. Headers, dirty words, mime-encoded images, you name it. In other words, that paragraph is dead wrong.

I haven't seen any garbled emails for a long time either (probably was when the parsing code was originally written). We support most encodings, (japanese and russian emails look cool), I suppose if you send an email with some odd encoding that Java doesn't support or mis-specify an encoding things might look garbled. Regardless - if you see a garbled email - please let me know!

I tried to replace the last paragraph with one explaining that the "text-view" sort of proves its not true, but my change was undone immediately (so fast I'd guess it was automated) because it was "Wikipedia Vandalism". I'm not too fond of the paragraph that says emails "disappear" given that its pretty misleading, but God forbid I try to fix that one.

If you're a common Wikipedia poster - please feel free to fix this entry. All I'm hoping to get up there is (something closer to) the truth.

Monday, January 21, 2008

How to search for the word "pen1s" in 185 emails every second

SubTitle: How Mailinator searches every incoming email for a few hundred strings efficiently (in Java).

In a humble moment of geekiness, I must admit I love tinkering with algorithms and data structures. I'd also like to point out that this article is not some attempt to educate you as to the best algorithm for any purpose, it is simply a documentation of an engineering thought process (through all its wrongs and rights) in coding up one, tailored algorithm. If you'd like a far more comprehensive treatise on string searching, this book is quite good as is this paper and this one.

A simple pen1s problem

Mailinator gets a fair bit of email - it varies wildly but 185 per second is a pretty common rate. It needs to filter some of it, flag some of it as evil, or simply perform some categorization on it. A lot of it is junk and it clogs the system. Like I've said over and over - Mailinator is not against receiving spam. If you want to get your email about porn or viagra or latest stock tips at Mailinator please feel free. But if anyone sends a billion emails about mortgage refinancing or weener enlargement to my server, it might get smushed and it is probably going to try to thwart them. It's not about the content (although unlawful or hate stuff can stay away thanks) - it's about handling the load.

So one way to do this is to scan every email that comes in for some keywords to determine it's an email we want to flag (i.e. maybe reject, maybe quarantine, whatever) in some way. This simple method can be surprisingly fruitful and it's the idea we'll consider here.

As a real example, let's say we wanted to watch for any emails that come in containing the word "pen1s". Pardon the word choice, but it is seriously a glaring use-case of mine - in fact, that's well into the tame-side of what I see. Its also a pretty good word because spammers often use it to avoid word filters that try to filter out the real word "penis" but it backfires in a sense - it identifies their email as spam. And they use it a LOT.

Searching for a given word in a constant stream of incoming data is sort of the opposite problem that search engines have. They know the search data at all times (i.e. all web pages) - but they don't know what search term you're going to walk up and ask for. Mailinator in contrast knows the search terms at all times (i.e. "pen1s", "v14gr4", "d0nkeyL0ve" (no, I'm not sure what that means either)) but gets new search data continuously.

So if you're a Java programmer, you can simply think that for any incoming email you do:

boolean containsPen1s = (emailBodyText.indexOf("pen1s") > -1);

Right on. This should work. And if you try to guess the obvious way the indexOf method works, you're probably right. It has a for loop that scans through emailBodyText looking for a "p". If it finds a "p", it then checks the next letter to be an "e", and so on. (If you're an algorithmy type person and wondering if I mis-spoke and if indexOf actually uses some smarter algorithm like Boyer-Moore or Knuth-Morris-Pratt, I didn't and it doesn't).

So, lets step back and do some quick math on how much work this is. As we said, we have 185 emails per second. From very strong empirical evidence I can tell you that emails in the system average about 2k each (some much bigger, some smaller, but that's an average).

So.. if we have 2048 bytes per email and 185 emails/sec .. that means we have 378,880 bytes per second.

Now the normal case, we only find the word "pen1s" rarely. By far most emails don't contain it. A near-worst case might be if all our emails contained "pen1pen1pen1pen1.." but that would probably be even more rare.

If we had no false starts at all (i.e. no letter 'p' anywhere) we would have to do 378,880 comparisons per second. That is, we must compare every byte in every incoming email we get to the letter 'p'. In truth, we'll do more (false 'p' matches). In fact the worst case is (the number of bytes of email) * (the number of bytes of the search term). Concretely in our case thats 378,880 * 5. In my use case however, that multiplier is quite unlikely, although a factor 1.5 or so might not be. Regardless, for simplicity lets use 378,880 with the understanding that in practice, it will be some small constant factor higher than that.

Note that you might be well asking right now why I'm not using regular expressions. Good question! I tried that. A state machine should do this comparison in O(n) every time - false starts be damned. But Java regular expressions sure do not. I found them very much slower in every case to simple indexOf. This recent article agrees. Regular expression matching in Java might be convenient, but it sure isn't fast. A nice discussion of state machines, string matching, and a bit of talk about why Java's implementation is slow can be found here (very worth a read).

Great.. So 378,880 comparisons per second is annoying but not too bad. As best as I can measure it, this takes my Athlon 4200 nicely under 1 millisecond (leaving us at least 999 other milliseconds to other server work!).

Enlarging your word list

Unfortunately, that doesn't solve my whole problem - my actual needs grow everyday. And really, I need to search for a lot more words than just "pen1s". In fact my current list is around 100 different words (and growing). So to use indexOf, I'd need to change my search statement to:

boolean containsBadWord = (emailBodyText.indexOf("pen1s") > -1) ||
(emailBodyText.indexOf("weenypie") > -1) ||
(emailBodyText.indexOf("captaincheeseburger") > -1) ||
....and so on

Things are starting to get hairy. On a normal case where we find no matches, we do 378,880 comparisons per search term. If I have a 100 search terms that's 37,888,000 total comparisons. My benchmark now takes about 60milliseconds. This simple task is now taking 6% of every second. And as we add search terms, it keeps growing. What if I get to 500 search terms? or 1000?

One improvement we can do right away is to use an algorithm called Boyer-Moore (which has many flavors now - see the papers I referenced above). These guys had a simple but very important idea.

Why, in fact, do we search our string from the front? Why not instead, start searching from the back?

So let's say someone sends us an email with 1000 X's. No other letters. And we dutifully go ahead and search for "pen1s" (which we won't find of course).

The naive way (like indexOf), we do 1000 comparisons checking "p" against each position (which are all X's). But instead, let's say we start off by hopping ahead 5 positions and instead look for the "s" at the end of "pen1s". Of course we don't find it, but then instead of checking the next spot, we again skip ahead 5 and look for "s". In fact, because we'll get zero matches, we can skip ahead 5 each time.

    ^    ^    ^  ....
    s?   s?   s?  ...

if we don't find the 's' in any of those positions, that immediately rules out the possibility that "pen1s" precedes it.

So instead of 1000 comparisons - we only do 200 !

This is a constant speedup, but it can be quite dramatic. The algorithm has more complexity to it than I've illustrated (like what happens if you encounter an 'e' in one of the hops?) and it won't often run in its "best case" as I discussed above, but thats the gist. The worst case is actually still just as bad as our original search, but again, thats not the likely either.

So in the supremely wonderful case for this example, Boyer-Moore-esque searches cuts our work down by a factor of 5, making it 75,776 comparison for each term or 7,577,600 for the whole 100 (i.e. 12 milliseconds in my benchmark). The scaling problem is just as bad however - every time we add a word, we still add a 75,776 more comparisons.

Most people simply assume that an O(n^2) algorithm is "slower" than an O(n) algorithm. Thats usually true in practice but it doesn't have to be. Big-Oh notation is really an illustration of growth, not speed. If you increment n by one, the O(n^2) will have to do n more operations to complete, the O(n) algorithm will only have to do 1 more.

Boyer-Moore in the best case is O(n/m) where n is the number of bytes being searched and m is the length of the search term (which in our example was 5). We know from algorithms class that O(n/5) is effectively the same as O(n) - because they both grow linearly. Yes, the O(n/5) will run five times faster than the O(n) algorithm - but as n grows, they both grow relatively the same. So if we can handle 100 search terms with indexOf at O(n) then we can probably handle 500 using Boyer-Moore. But we've really just delayed the problem. Further increases will still grow our algorithm out of reach.

Improvement here is based on 2 things:
1) How important the speed here is to your application and/or how big your library of search terms become.
2) How much of a kick you get out of improving, coding, and tinkering with algorithms just for fun even though it really might not be worth it when you probably should be out picking up girls (or guys), skiing, playing Battlefield2 or something.

In truth, our original indexOf is good enough for many applications. And Boyer-moore handles even more. But I, unfortunately often subscribe to rule #2.. so on we trudge...

The lack of scalability starts to bite us. The number of comparisons grows by 378,880 each time we add a new search term for indexOf and by 75,776 for Boyer-Moore.

Trie this

Ok.. how about another idea. How about if we organize our data better. We know all our search terms up front, what if we could build a structure that was much easier to search.

One such structure is called a Trie (pronounced like "try" even though it supposedly comes from the center of the word "reTRIEval"). So again, lets say our search terms are:

"pen1s", "darn", "golly", and "dart"

We construct a trie as a graph of connected nodes where each node (or edge really) represents each letter as so:

       / | \
      d  g  p
     /   |   \
    a    o    e
   /     |     \
  r      l      n
 / \     |       \
n   t    l        1
         |         \
         y          s

So each time we need to search we start at the root node and see if we can head down. If we encounter "darp", we'll track down 'd', then 'a', then 'r', but find no 'p' node. Hence we know we do not have a match.

How many comparisons must we do? 378,880. The running time now is back to our single-term indexOf method and is at O(n*m). But notice, it isn't dependent upon how many search terms we have.

Think about that - that's crazy - in effect, you can search for a (theoretical) infinite number of search terms simultaneously all for the cost of ONE indexOf ! (This is of course, the best case scenario).

In other words, we're back to doing 378,880 comparisons but now we can search all 100 search terms with just those 378,880 comparisons. And you want to add 100 more? Sure, still 378,880. 1000 more? no problem. Still 378,880.

The only thing that gets hairy with tries is storage as those nodes can be rather bulky (as you'll see if you try to implement one).

One thing you might notice is that although we gained the ability to search for any number of terms at once, we lost the Boyer-Moored-ness. We must search every byte. That was only a constant speedup, but still it was rather nice.

The algorithm used in Mailinator actually tries to put the Boyer-Moored-ness back (with interesting results). In other words, it tries to search any number of terms while only doing O(n/m) comparisons in the best case (where n is the number of bytes to be searched and m is the length of the shortest search term).

This post is plenty long enough. To be continued....

(Thanks to Frank Yellin for reviewing this post.)

Your own private Mailinator

As you probably know, Mailinator has a always-growing list of alternate domains. However, those are just the domains sponsored by us.

In truth - Mailinator accepts ANY email you fling at it*. In other words - you can setup your OWN personal domains simply by setting your MX record (i.e. through your registrar or DNS server) to point your mail delivery to Mailinator!

Simply put - if you own, you can simply change that domain's MX record to point to and all email you send to will then end up in Mailnator.

You can publicize this fact - or you can keep your personal alternate domain a secret and have your own private Mailinator !

*pursuant to normal draconian Mailinator email rejection criteria including but not limited to hate email, unlawful email, etc.

Saturday, January 12, 2008

New Engine, New Features

A few months back I finally hit a wall enough with trying to add features to my original design of Mailinator. As with any piece of code - it had aged. I don't immediately impune such code as being bad or poorly designed, its just that code needs TLC every now and then -and- developers learn more and more about their domain as time goes on.

Anyway, a few months back I set out to do a full engine rewrite. Partially to redesign things to support some future features I want to do, but also to streamline the whole thing and fix a pile of bugs that had been floating around

One thing the old engine relied on a lot was Java Strings. And that would all be fine and dandy except that at the core, most servers are really about bytes, not Strings. Front ends are about strings, browsers are about Strings, but at the server, strings are just programmer abstractions for thinking about the data. The data itself is just bytes.

So the new engine is bytes.

I also did a lot of testing with how Java handles its networking. And how the 80-bazillion smtp servers out there send their data (and let me tell you, the RFCs which spell out the rules are most definitely just "guidelines" - those rules are broken all the time). I got pretty nitty-gritty and as I am wont to do when I'm writing code for fun, I wrote a ton of stuff from scratch.

Anyway - the good news is that the server can, according to my tests, handle about 1.5times the load it could before. I'm not really sure what the true abilities are as its hard to loadtest without a "real" few thousand machine hitting the server.

We're still running on one server (altho its beefed up a bit only because ServerBeach had some hard-to-pass-up deals) and I'm guessing it will scale easily to 30million emails a day. The max I've seen is 16million in one day and that was the old server code so I'm pretty conservative on the 30million (I expect the network to bottleneck before the server actually).

Hopefully, users won't really notice any of what I've mentioned so far - that is, hopefully mailinator will just work as great as before.

Things you will notice is that we have better language support - its seriously cool seeing all the Japanese and Russian spam fly by. It also supports encodings such as quoted-printable and base64 much better.

And finally.. I added a quickie feature that was suggested to me literally years ago. You can now access mailboxes right from your browser. Instead of going to and typing in your mailbox name into the form, just type:

And it will web-a-port you right to that mailbox. Mailinator is about removing inertia from getting email - and this feature is just another feather in its cap. : Anatomy of a Spammy Campaign

Mailinator is a popular disposable email service. It's also become a great tool for QA Teams to test email receipt, acknowledgment, au...