Tuesday, February 21, 2012

How Mailinator compresses email by 90%

Given the title of this article, the first thing that should pop into your mind is probably - "well, use a compression algorithm - right?".

Right! Well, yes, well, not exactly. Read on.

Your second thought might also have been - "Why bother? Just buy more disks."  Which in the big picture is also not a bad answer. But for Mailinator that doesn't work - if you have read previous Mailinator tech articles you might know that Mailinator stores all it's email in RAM.

There were good reasons for that when Mailinator started. One was the use case - which was always disposable email that lasts a few hours (rather longer nowadays). Secondly, when Mailinator started, disks and datastores weren't as sophisticated/fast as they are now.

Also, Mailinator is/was always a free service so keeping costs down was always important. To this day, Mailinator runs on a single server. It averages about 4-5Terabytes of bandwidth a month and the peak incoming email rate I've seen is about 3500 emails/sec (this is just a production observation, server limit is bandwidth, not CPU).

And finally - last but not least - to me, much of web and application development today is utterly devoid of any fun algorithms. I spend a non-trivial amount of time in interpreted/dynamic scripting languages that do a fantastic job of hiding (or at least lure me away from thinking about) algorithmic complexity. I've probably inadvertently written more n^3 algorithms than, um, (n^3)-for-some-large-value-of-n.

Mailinator has always been my test bed for trying fun ideas, algorithms, and datastructures. In other words - I probably didn't need to do all the work I'm writing about here - but I definitely did have fun doing it (probably should have been out talking to girls, but alas).

Compression

Ok - so back to 90% compression.

So to start testing, I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85% !

Well. OK! Article is over! Everyone out! 85% is good enough.

Actually - there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.

That works great on static files - but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email.  If I compressed a million emails, and then some user wanted to read email #502,922 - I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible. And, as I said, the Mailinator cache is constantly throwing out old emails and putting in new ones.

In other words, an algorithm that relies on previous entries to build a dictionary can't work given that we keep purging the front of the stream never to be seen again.

Hence, we cannot compress emails "together". But we can compress them individually. Sadly, this hurts our compression ratio - and by a lot. The algorithm now must start building a new dictionary with each email. And emails are small so the dictionary isn't very mature by the time we're done compressing in many cases.

We can help this situation by giving the compression algorithm a pre-built dictionary. That is, scan a typical piece of data to be compressed, find common sequences and create a list of them. Then we give that dictionary to the compressor/decompressor as it takes off.

Woopsie. Again, the Mailinator stream is a living and breathing entity that's always changing. One minute might be a few million viagra spams, the next minute might be all about fake rolex watches. In other words, there is no "typical piece of data" -  a static dictionary built off a sample of emails will be obsolete in relatively short order.

So, the first idea was to build a sliding dictionary builder. Each email is scanned for string occurrences and we keep a count of them. Then every so often (minutes or hours), the compressor switches to using the most recently constructed dictionary. Every compressed email is given a reference to its dictionary so when/if it needs to be decompressed, it knows what dictionary to give the decompressor. Many thousands of emails share the same dictionary so RAM to store dictionaries isn't particularly significant.

Well, that's great and does restore LZMA back to about 60-70% but remember I mentioned I had  another problem with LZMA? Speed.

The C++ version of LZMA by Igor Pavlov compresses at about 1.7MB/s per CPU core  on my test machine. Um. no. Firstly, Mailinator can pull down tens of MB per sec at times. Secondly, no component of our processing pipeline can be allowed to take up this much CPU (my rule, not yours). We need our CPU for other things when large volumes of mail arrive. (The java version by the way was about the same speed).

Simply - LZMA is pretty awesome - but it's too slow for this purpose.

So the for the moment, I fell back to using a fast but simpler compression (zlib/LZW) on individual emails - and we sink down to about 40-50% savings from compression.

A Bigger Idea of a "Dictionary"

The next step for me was to think about email composition. We get lots of different types of email - but we get lots of the same types too. For example, we get lots of newsletters (people send them to Mailinator then read them via POP or RSS).

The nice thing for us is that a newsletter email blast could be 10,000 emails that are, all the same. Well, ok, not exactly - no two emails are ever the "same" because headers have times, dates, message-id's, etc. within them. But if we remove the headers, you can get 10,000 emails going into 10,000 different inboxes that all have the same message "body". Are you thinking what I'm thinking?

Right - store each email with it's own headers plus a pointer to ONE system-wide byte-array containing the newsletter body. What's the "compression" ratio of that? Well over 90%. And just to be a snot we can then apply compression to that byte array to eek out another few percent. We're reusing memory here so it's not exactly "compression" but we are reducing the size of the data sent to by some fantastic amount for this happy use case.

This isn't a revolutionary idea (online music libraries do the same thing) but it does fit pretty nicely in the Mailinator paradigm. Sadly apart from newsletters, not many other email sets, spam or otherwise have email bodies that are identical. In fact, spammers specifically change the subject line and destination url of every email they send for tracking and spam-detection-thwarting purposes. So what you get is something like this (headers omitted):

Email 1:
Buy vi4gra now!
http://rrr4.somerandomthing.com/?3jwow33oo
Happy man are you will be!

Email 2:
Buy vi4gra now!
http://1rr220.somerandomthing.com/?ajo200kko
Happy man are you will be!

So much for simply detecting identical email bodies. And this goes for less nefarious things too. Sign-up emails from websites will contain the same surrounding text with different names and validation urls inside.

What we could use here is a Longest Common Substring (LCS) algorithm. Basically, it would compare the two email bodies and be able to break them up as:

Common string 1:
Buy vi4gra now!\r\nhttp://

Disparate strings:
rrr4.somerandomthing.com/?3jwow33oo
1rr220.somerandomthing.com/?ajo200kko

Common string 2:
\r\nHappy man are you will be!

Nice .. each email is stored as 3 (compressed) byte arrays where 2 of those can be shared.

Unfortunately, classic LCS algorithms are expensive. Comparing two sequences is an O(nm) algorithm. And we're not interested in comparing two sequences, we're interested in comparing each new sequence (er.. each new email) with the few million that preceded it. Also, the LCS algorithm is also very memory expensive in the creation of trie datastructures - again, scaling to millions of emails just doesn't fit in our parameters.

Generally speaking, there are a lot of tricks I've noticed in analyzing algorithms. A few off the top of my head are: if you see an easy O(n^2) algorithm, it's rather likely there's an O(nlogn) one hiding in there somewhere. In contrast, if your dataset is small, you might be better off sticking to algorithms that make your CPU's cache and instruction pipeline happy instead of worrying about algorithmic running time (i.e. bubblesort > quicksort for small data). Lastly - if you can make assumptions about your data, you can often short-cut the classic algorithm with an good approximation.

Caching Lines

Cool, so let's assume something about the data. For emails, as it turns out, disparate parts of emails often occur on line boundaries (as you see in lines 1 & 3 above). A few same lines, a different one, a few more same. Instead of looking for common sequences based on individual characters, we can treat individual lines as units. Then we can attempt to find multiple occurrences of those lines. It cannot be as precise as LCS proper as in our above example (we would not find the identical portion "http://" in line 2) but we're basically settling for a greedy approximation, and one that works pretty well.

How do we store it though? LCS's tries would kill us. I know - let's use an LRU cache. Those darn things work for everything!  We can use an LRU cache that caches full email-lines. It will inherently flush out old email lines as the spam stream evolves (nice!) and will provide quick look- ups to compares thousands of lines at once (happy!). Specifically in Java, an LRU-cache is a synchronized LinkedHashMap with true as the last constructor parameter and an overridden removeEldestEntry.

So we store a few 10's of thousands of email lines in an LRU cache and then as each new email comes in, we check to see if that line is in the cache. If it is, we reuse the one in the cache instead of creating new storage for this email. By assuming all common sequences are bounded at newlines, we remove the boundary-discovery work LCS must do. Strictly speaking, we're cheating and losing some opportunity, but it's a good enough guess for this type of data.

This had a dramatic effect on our "compression" (again, it's slighty dubious to call it compression but, as you consider the big picture, our entire machinery of the LRU cache and bastardized LCS-in-spirit algorithm is creating a reuse-dictionary, it might not actually be compression - but it goes through several of the motions).

Caching Multi-lines

Caching lines is great - but what about caching multi-lines? Say we have a few emails - for brevity, assume each character in the following examples are email "lines":

Email 1:
ABC1

Email 2:
ABC2

Email 3:
ABC3

Email 4:
ABC4

So the first 3 lines are all the same in each email (ABC), the 4th lines are numbers which are not the same. Our algorithm:

1) Load a LINE and see if it's in the cache (if no more lines, quit)
2) .. if it's not there, put LINE in the cache, and store LINE in the email - GOTO 1
3) .. If it IS there:
4) .... see if LINE + NEXT_LINE is in the cache
5) .... if its not there, put LINE + NEXT_LINE into the cache and store LINE (which is a cache hit) in our email - GOTO 1
6) .... if it IS there, LINE = LINE + NEXT_LINE, - GOTO 4;

So if we run our 4 emails above through this algorithm. We get the following:

Running through all of email 1 - we get:
- Cache HITS stored in email: none
- Cache MISSES stored in email: A,B,C,1
- Cache contents afterwards (lru order): 1,C,B,A

Running through all of email 2 - we get:
- Cache HITS stored in email: A,B,C
- Cache MISSES stored in email: 2
- Cache contents afterwards (lru order): 2,C2,C,BC,B,AB,A,1

(notice how '1' (which didn't cache hit) has worked itself to the end)

Running through all of email 3 - we get:
- Cache HITS stored in email: AB,C
- Cache MISSES stored in email: 3
- Cache contents afterwards (lru order): 3,C3,ABC,AB,A,2,C2,BC,B,1

Running through all of email 4 - we get:
- Cache HITS stored in email: ABC   <-- very cool result, note coolness
- Cache MISSES stored in email: 4
- Cache contents afterwards (lru order): 4,ABC,AB,A,3,C3,C,2,C2,BC,B,1

So what happened? The system has realized that ABC is cacheable and is now pointing to that. All subsequent emails with the set-of-lines ABC will reuse the same memory. Note that the disparate lines 1,2,3, and 4 will always be stored separately, but the algorithm will then pick-up any common line-sets later in the email too (if there were any).

This elaborate system to find equal email lines and reuse them drags out compression of the entire flowing email stream down to about 80%. What about 90%?  Well.. one more trick.

Back to LZMA

Remember LZMA from above that we abandoned because it was too slow to happen inline? As you'd guess, the biggest impact it had was on bigger emails. And although it's a CPU hog, we do actually have a few cores laying around. So let's give it one (but seriously, just one).

We setup one core (i.e. thread) to trail behind and scan incoming email for ones that are over some size (say 20k) and re-compress those using the sliding dictionary LZMA we mentioned earlier. While 3 of our cores average 5-10% utilization by receiving, analyzing, and storing incoming email - the 4th core sits at 100% re-compressing emails where it will find benefit. If it gets too far behind, it simply leaps ahead and leaves some compression on the table.

(Note that empirically, LZMA is an order of magnitude faster decompressing than compressing, otherwise that would have been a new problem as it could take too long when someone wanted to read an email)

Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes better, sometimes worse and 2: I realize I'm not exactly sure what "Voila" means, looking that up now).

There are also some other important notes. Storing a byte array in Java costs something. The pointer alone (64bit) is 8bytes. Then there is the byte length field, padding, etc. In other words, I limited the system to never store email lines under 64 bytes. Small lines get concatenated together straight away.

Second, there are more email-idiomatic tweaks we can do to improve the situation. Base64-encoded attachments are effectively un-cacheable, so we pass over those.

Third, although from our cheeky example it may seem like we're finding optimal line sets (i.e. ABC). We're not. We could end up caching ABC and destroying an opportunity for a more optimal BCDXYZ or something. I'm guessing this doesn't happen often but would be an interesting future consideration.

Edit: Wow, sincere thanks to an Anonymous commenter for making me reconsider the above algorithm. I had originally stated it was O(n^2). My first version was indeed O(n^2) (which wasn't written about) and after a few changes it became O(n) and I failed to see that. I find its very easy to find tech reviewers once an article hits Hackers News, before then though - not so much. :)   My apologies for the error.

So for the end-user, this whole diatribe simply means little except their emails are sticking around longer. They have no idea that when they click to read an email we may be LZW or LZMA decompressing tens of byte arrays shared by thousands of emails with a custom-sliding dictionary built by scanning emails that arrived hours ago and then catenating them together so they can be shown on their webpage all in a few milliseconds. And they likely don't care, they're probably too busy signing up for Minecraft or something.

But that's ok. I know.

And if you got this far, you know too.

Ok.. now back to real work. What was I doing again? Oh yeah, writing some slick one-liners in Ruby. No clue on the running times - probably like O(n^4) or something, but if I fiddle with it a bit more - I bet I can cut the character count of the code by half!