Wednesday, December 18, 2019

Mailinator Private Domains

The public Mailinator system has always had a privacy policy somewhere along the lines of "There is No Privacy". (see our Privacy Policy page for the more comprehensive version). It's no wonder then we call that part of the system "Public Mailinator" - the design of the system itself doesn't allow privacy. Every inbox is available to anyone.

People often think of that system as "disposable email" which isn't particularly wrong, but we more think of it as a website that accepts email. Forgot any notion you have of email "accounts" or "ownership". In the public system, every email and every inbox is public for every one.

This is why a few years ago we created the concept of Mailinator Private Domains and the Mailinator Private system. Many companies use Mailinator to test their Email Workflow. Whether that's automated testing of their sign-up system or load testing deliverability (and speed) of their next marketing campaign. Mailinator can give them trillions of different email addresses on tap.

No need to create accounts, anytime an email arrives at Mailinator (public or private), the inbox is created to house that incoming email. And that goes for Mailinator Private Domains too.

Subscribers to the system get their own private domain. And that's private like - completely private to their team. The system will assign a new subscriber a private domain immediately (i.e. it will most definitely be something other than @mailinator.com). That being said, subscribers can change their private domain to any domain that they own - e.g. yourPrivateDomain.com.

In either case you can use any inbox you want @yourPrivateDomain.com.

Having your own private domain affords many more big advantages:

  • Unlike the public Mailinator system - you don't just see one inbox at a time. You can see ALL inboxes in your domain in real-time - on one web page. Some subscribers send at pretty high speeds and it's amazing to watch what looks like a normal email inbox scrolling at high speed with incoming messages. 

  • An API. You can access all your private domain email via API (the API happens to work for the public system as well)
  • Message Rules. It's your domain - it's your rules. You can say "All email that arrives at joe@yourPrivateDomain.com" should be auto-forwarded to my gmail account". Or "all the links in the email should be automatically clicked". Or "the email should be converted to JSON and forwarded to this webhook". It's a pretty powerful system.
If you're using Mailinator for testing your Email Workflow, it'd be worth your time to try out the Private system. The advantages are real. Try out a  Free Trial today.

Monday, November 11, 2019

Mailinator Updates: new API, SMS messaging, and Routing Rules!


Mailinator Updates!

SMS numbers now available

Need to add testing SMS workflow to your QA process? Mailinator offers private SMS inboxes for all your testing needs. SMS messages are retrievable via Web, API, and Rules just like all incoming email.
 

Fully Active Rule System

Setup Rules to automatically react to emails arriving at your Private Domain. Emails can be forwarded via SMTP or WebHook, have links within an email automatically "clicked", or setup a "email blackhole" for high-speed testing.
 
The Mailinator API let's you pull emails, the Rule system tells Mailinator to push them back to you!

New API

We've redesigned our API. Apart a more logical REST flow, the new API has some new capabilities:
  • List and retrieve attachments of any MIME type
  • You can now Inject emails into the system using HTTP Post. This is very handy for testing message flows when you want instant delivery.
New format:
GET /v2/domains/:domain/inboxes/:inbox/messages/:message_id
 
(Note: The old API will continue to remain active. You do not need to migrate to the new API at this time). 

View All the Mailinator System Documentation here:

https://manybrain.github.io/m8rdocs/#mailinator

Sunday, July 15, 2018

Mailinator.com : Anatomy of a Spammy Campaign



Mailinator is a popular disposable email service. It's also become a great tool for QA Teams to test email receipt, acknowledgment, authentication loops, and formatting. But in the beginning (um, wow - 15 years ago? really?) it was a tool to help people avoid spam.

Mailinator gets many millions of emails per day. At peak, it can be thousands of emails per second. Over time it has become ever more efficient at dealing with the deluge; processing, compressing, storing, and analyzing them.  Once someone uses a Mailinator address, they tend to abandon it and use a different one next time. But the Internet never forgets. It seems like every Mailinator address ever used has found its way onto a marketing list somewhere.

Here's a picture of the last 6 days of email entering the system:


Six days of frequent Subjects coming to Mailinator

This graph (click on the image for a bigger version) shows a sub-set of email entering the Mailinator system. Each vertical bar represents a 10 minute interval. For each 10 minute interval, we count emails that have the same subject line that arrived 10 or more times. In each bar, a colored segment represents emails with a single subject line. Blasts of email that come as one-offs aren't on this graph, so this isn't so much a representation of Mailinator's total volume but more a segment of our flow that feels spammy.

Also, we're only selecting from email sent to the Public Mailinator system. Our subscribers get a separate, private (persistent) email system where they can send their test emails. None of those emails are represented here - and they wouldn't be anyway since they're not spam.

Look at that spike of 190k emails on July 4th at 12:40pm in the center of the graph. You also might notice that most of that line is one color (green). That shows the many tens of thousands of emails with the same subject line that Mailinator received in that 10 minute interval.

Consider also the corresponding graph:


Six days of sending IP addresses

This graph shows the IP addresses that sent emails in the same time period. Notice that the spike is still there, but the graphs don't match up. Similarly to the first graph, only IP addresses that sent 10 emails within 10 minutes are on the graph. Except for a few spots, you don't really see the same incidence of colored bands. The IP addresses are far more spread out. The implication here is that many, many IP addresses are responsible for sending the same email to Mailinator. It's not uncommon to see thousands of IP addresses involved in one spam blast.

Even so, you can still see that spike at 7/4/2018 12:40pm. That was a company sending a 'marketing campaign' from one IP address. We won't mention who sent those (what they're doing isn't particularly interesting, and is far from unique), but it's likely that IP address was blacklisted in Gmail and other more discriminating email systems. Clearly, Mailinator isn't particularly discriminating.

Here's a closer look at the top graph:


Spam Campaign at 5am

Here we've zoomed in on the "Top Subject Lines" graph for 7/5/2018 3:10am to 7/5/2018 8:47pm. The first thing we notice is that spammers are prompt! That campaign starts exactly at 5am.

Look at the overlay. From 6:40 to 6:50 we got just over 60,000 emails that fit our spammy profile. There are three spam campaigns listed. The first one is for Michael Kors Bags. Classic Style! 90% off! What a steal! Well, I'm sure they're very nice and convey an exclusive feeling. They sent our system 15,400 exclusive offers in that 10 minute interval.

Next in line looks to be (just guessing) a phishing campaign for people who might happen to have a Netgear router.

The third one is a combination of several subject lines all coming in around 4.4k, 4.5k, and 4.6k (if we mouse-over other intervals the Netgear campaign is all over the place, but this one is consistently 4.5k).

Notice we're highlighted on an email with the subject line "Welcome to our company:" Let's consider a graph that shows all the IP addresses that sent us an email with that subject line in that time interval - and we'll collect them all, not only the ones that sent 10 or more:


One Email - thousands of Sending IPs


So many different colors! Which means that it's a lot of different IP addresses.

Check out the mouseover for that 10 minute interval. For all those emails, the most we got from a single IP address was 3. That's a pretty well distributed spam network! Good job guys.

That email and it's siblings ("New offer", "Good day!", and "Hello!") are all part of the same campaign sent from many thousands of computers. From what we can see, it's being going on for more than a week.

For a while now, Mailinator has used this type of information to better store the emails that people are actually requesting. Remember, Mailinator doesn't have user accounts associated with inboxes.

The public inboxes here don't belong to anyone. The very nature of Mailinator is that inboxes are completely public. A fair characterization of our privacy policy here is: "At Mailinator, there is none!", and this is a good time to remind people that they should never send private information to Mailinator's public system (our FAQ used to say that if you sent super-sensitive private stuff here then you were a stupidhead - but now we're more polite).

[But still, please, don't do it]

We think we can help provide pretty good anonymity in some cases. The public system doesn't really want to know who you are.

Mailinator populates inboxes when email arrives. One way to think about our system is that all inboxes (trillions of them *) already exist, it's just that most aren't currently being used. Many users have found us to be a very handy tool for receiving a one-off email. More and more people are using us for email testing and that's very encouraging.

The sort of analysis we talk about here has helped us make some some useful and interesting architectural choices, and it has informed the way we deal with the enormous volume of email that Mailinator has evolved to receive.

It's also got us thinking - what sort of information lurks in data that flows our way every day? We have tools and a back-end infrastructure that allows a very fine-grained look at a really large volume of spam. Now that we have the ability, we're really excited to dig a little deeper into this pile of potted meat product.

Thanks for using Mailinator and keep sending it spam. It makes our graphs look pretty cool.




* It's actually much more than trillions. We looked at the RFCs, and with 64 characters in the <local-part> of an address, and the weird mix of rules for allowable characters, and even weirder rules for when and where some characters are and aren't allowed, it's a bit convoluted. But with a crayon and the back of an old envelope, we figure that a decent approximation is something like 2.4 x 10^111 inboxes for a domain.

If your crayon is sharper than ours, we'd love to hear your estimate.


Also, Mailinator can handle 212 characters in the <local-part> - which makes the number considerably larger.


You might need two crayons.




Friday, July 28, 2017

Introducing the Mailinator Real-Time Inbox

Mailinator is changing the way email works.

For most users, an inbox is an email’s final destination on a journey through the internet. It’s an endpoint, your endpoint, that holds your emails as they come to you. Checking your email, or refreshing your browser, is when your inbox asks if there’s anything new for you.

But Mailinator doesn’t see it that way because inboxes @mailinator.com don’t really belong to anyone. The public Mailinator system gets many (oh so many) millions of emails per day - and you can read any of them. So can anyone else. With Mailinator, you don’t even have to create an inbox. You can just start sending email to it.

One way to think about it is that all inboxes already exist. Another way to think about it is that Mailinator creates inboxes on-the-fly as email arrives.




From Mailinator’s point of view the email flow is relentless and constant. We call it the Mailinator Stream. Previously, inboxes got new mail the way all inboxes do - periodically they asked the server if there was any new email to display.

But now we’ve hooked you up to the stream. Directly. To every inbox. In real time.

Welcome to the Mailinator Real-time Inbox. 

The moment someone sends an email to Mailinator, it’ll show up on your screen (just as fast as the internet allows). There’s no delay. You’ll never again have to hit ‘refresh’ on a Mailinator inbox. (Go ahead and give it at try - pick an inbox @mailinator.com, send an email to it, and watch it arrive an instant later - be sure to turn off the "undo" feature of Gmail or it'll delay 30 seconds before sending)

Now, with Mailinator’s inbox page tapped into the stream, an inbox is a filter. Everything that matches your filter will show up in real time.

For most users, that won’t make much of a difference. You’ll get your anonymous, no-sign-up emails like always, except now they'll arrive instantly. An inbox will still act like an inbox.

But paid subscribers with private domains can query the stream at a deeper level. (For those who didn't know, you can upgrade your Mailinator account to get your own private, not-auto-deleting version of Mailinator pointing at at your own domain that your whole team can access simultaneously).

Subscribers can create more complex filters and persistent queries that capture more than just a single <inbox> from the stream. You can tap into multiple inboxes at the same time. You can also use a trailing asterisk as a wildcard. Asking your inbox to capture testing* will fill your inbox with all the emails sent to testing1, and testing99, and testing-xyz, and anything else which matches.

Mailinator has always been about providing public email addresses for anyone to use (no signup required), and now it does it in real-time with some great new features. Interested in your own private Mailinator where every inbox you can think up is yours (instantly and wildcard searchable)? Request a Trial today.


Thanks for using Mailinator!

Thursday, May 4, 2017

Mailinator and the Recent Google Docs Phishing Attack

Yesterday (Wednesday, May 3rd), someone launched a clever phishing attack against Google users.

They wrote a little application and falsely named it "Google Docs." Given the chance, it compromises your Gmail account and reads your contacts list. Then, using Google’s own email servers, it sends itself to all your contacts.

This kind of attack lives or dies by how successful it is at getting people to believe it and click through. If no one clicks, or if no one grants it the permissions it asks for, then it doesn't spread and nothing happens. But every time someone clicks through and grants the app the access it wants, the attack multiplies by the number of times it propagates.

The attack was very successful at getting people to click through and grant it permissions, and it spread very quickly.

But it did so in a quirky way, one that made Mailinator one of its victims. It emailed itself from one victim's account to the next set of targets by BCC (blind carbon-copy). If you received the email, it came to you that way. Your email address wasn't in the TO field, it was in the BCC field.

That's a little bit clever. Since the BCC: is blind, the victim doesn't see themselves included in a long list of recipients (many of whom they don't recognize), which is often a big clue that the email is nefarious (or some dumb joke that a clueless relative has forwarded to everyone they know). Any one attack email might have been BCC'd to dozens of new victims.


When you send an email, you can't just BCC people. To make it legit, there has to be something in the TO field. Since all the victim's contacts were in the BCC field, the address the emails are TO isn't part of the attack. The attacker just needs a dumping ground. Something that looks just-so to the target; not a name that stands out for being unfamiliar, but also a real address that works and won’t bounce. That will improve deliverability and believability. So who did the attacker choose to send these emails to?

hhhhhhhhhhhhhhhh@mailinator.com

What is the significance of that inbox? As far as we can tell, there isn't one. Since all addresses already exist at Mailinator, using your finger to hold down a key for a bit generates a completely innocuous - but usable - inbox name. The significance of using mailinator.com is clear: our well-deserved reputation for successfully handling high volumes of incoming email.

In effect, they were relying on Mailinator’s proven ability to receive lots of email.

From the attacker's perspective, it doesn't matter much what is in the TO field. They just need to make sure that the emails get out the door. Mailinator is very good at receiving emails, the attack spread very quickly, and very soon the hhhhhhhhhhhhhhhh inbox at Mailinator was getting thousands of emails.

We noticed the activity early on, and shut down the inbound stream of emails to the hhhhhhhhhhhhhhhh inbox. Unfortunately, this did nothing to stop the attack. That's because nothing about the attack was happening via Mailinator. Mailinator was simply another recipient of the email. The attack could not propagate itself via Mailinator but it sure could send us email when other people propagated it. By the time we shut down the inbound stream, hundreds of thousands of emails to hhhhhhhhhhhhhhhh had shoved their way through our system. Any one inbox at Mailinator only shows 50 emails, and that inbox was (at peak) receiving a few thousand emails per second. Because the volume was so large, it wasn’t possible to read any of these emails at Mailinator. Before anyone could have clicked on one to read it, Mailinator had expired it to make room for thousands more that were pouring in.

The emails to hhhhhhhhhhhhhhhh were of no consequence to the attack. Even the name was irrelevant.

Strangely, we've seen a few articles mentioning that that inbox was the sender. Most even showed a screenshot on the same page contradicting that idea with the Mailinator address clearly in the TO field and one of the victim's contacts as the sender.

Just to be clear: the Gmail phishing attack sent email to a victim’s contacts using Google’s email servers and all emails were FROM another Google user. Each time the attack propagated, it also emailed TO Mailinator as a receiver, nothing more. None of those emails came from Mailinator, they came from Gmail.

It's true that all of the attack emails were TO: hhhhhhhhhhhhhhhh@mailinator.com, but it seems pretty clear that if you're looking for where the emails came from, you'd want to look at the FROM: field.

Additionally, the emails can't be from Mailinator because Mailinator can't send email. It’s a receive-only service. The Mailinator system was not part of the attack - it was just a recipient like all the other victims. The only difference between us and the other recipients was that we received hundreds of thousands of these emails in a very short time.

If you visit that inbox now (here's a link to hhhhhhhhhhhhhhhh inbox - don't worry, it's safe to click, it will take you right to Mailinator), you’ll notice no such phishing emails. We turned the inbound stream of attack emails off very soon after the phishing scam started.

Service to our legitimate users was uninterrupted.

Mailinator remains, as always, the best place to get a free, disposable email. We can’t prevent people from sending email to us (receiving email is the whole point of the service!) and we still love our regular users: thousands of QA Teams that send us millions of test emails, and you - whenever you want to protect your real email address from spam.

Thursday, August 27, 2015

Mailinator's API : Closing the Loop for QA

Mailinator's API provides programmatic access to all email within the Mailinator system. If your QA department has a need for delivery testing, this is an unprecedented way to have immediate access to thousands of email inboxes.

Mailinator API docs

Since we launched the API a year ago, we've been constantly surprised about the uses people have found for the API. Testing signup systems (with a confirmation email) seems to be a popular choice, but other customers test marketing campaigns and test automated alert systems sending us thousands of emails each day.

The API works on both the public Mailinator system and the private domain system. For the public system you have instant access to any of Mailinator's millions of emails, obviously including the emails you send there.

If you're looking for more privacy, you can setup a private domain and only you (and your API token) can see emails that arrive for that domain.


If you're interested in accessing emails via API - Sign up at Mailinator today and try out the API!

Monday, September 22, 2014

Playing Wasteland 2? Find the Mailinator gun !

Mailinator has always been about vanquishing spam - and now in Wasteland 2, you can vanquish some Wasteland Wolves using "The Mailinator" gun !

Mailinator is proud to be a backer of the newly released Wasteland 2. If you're playing the game (and you should be!), be sure to lookout for the gun called "The Mailinator".

(hint: you might consider having one of your rangers specialize in Heavy Weapons !)

Saturday, November 9, 2013

Increasing Brand-Value of Mailinator.com with a Web Comic

[This is a guest post by Marketing/Growth-Hacking consultant Martha Andrews - contact her directly at martha@manybrain.com.  Follow Martha on Twitter @another_martha]

We have been doing some investigation in increasing brand awareness for Mailinator.com.

Mailinator has always been a useful site providing free, disposable email - but analysis of the usage patterns of Mailinator indicated that users tended to use it on a monthly or bi-weekly basis rather than weekly or daily. In some ways this makes sense – the site provides a utility, but the use case for the general public isn't necessarily needed on a daily basis. When it is needed however - our goal is to make sure you think of Mailinator.com!

In order to raise all usage levels of the site, we focused on public awareness of the site, expecting that an increased DAU (and MAU) would/should be an expected side effect.  I’ve frequented many trainings in the last year, and I cannot tell you how many times I’ve heard, “Create interesting and useful content and people will flock to your site…Its all about providing content.” 

But what content could we give to our users?  We already give away the service – we have a useful tool, and want more folks to know about it and use it.

Before creating content, something to consider is who is in our user base?  Web analytics told us the primary user base is male, ages 18-40 – think of the average Reddit user.  So how do we engage 18-40 year old men, and potentially other demographic groups?  How do we get them to think of Mailinator.com more often?  What could we provide that would be of use of them?  Getting folks to think of the site more often, and giving them content and reasons to visit the site would likely expand their current usage pattern.

We tried targeted ads through traditional web and social channels. Specifically - we experimented with Google Adwords, Reddit.com, Facebook, and Twitter.  The ads provided momentary increases in traffic to varying degrees but nothing long lasting.  We can get more users to Mailinator any day of the week by spending some money, but that cannot be sustained given that mailinator.com doesn not generate direct user revenue.

During a recent site redesign we incorporated a new mascot of sorts – an amiable looking caped crusader we dubbed “Mailinator Guy” - the Anti-Spam Superhero. 
The more we lived with Mailinator Guy, the more often we started asking: what if we launched a web comic based on Mailinator Guy?  Would a comic be appropriate content?  If you've ever read Mailinator's FAQ you know the tone of the site is already quite tongue-in-cheek - we at least amuse ourselves.  Frankly the thought of bringing Mailinator Guy to life was simply too fun an experiment not to try.

ComicRocket has over 35,000 web comics indexed on their site alone, so competition for user attention is stiff.  However we found Facebook ads for the comic to be highly successful in engaging a new audience.  We think it works well because they are demographically targetable, colorful, eye catching and fun.  A thumbnail or portion of the strip is always displayed in the posts and ads.

We also engage our existing audience through the Mailinator Facebook page (with >30000 likes). There is an announcement on the page when a new comic is published. Additionally, the colorful thumbnail version of the strip is shown on the landing page and on mailbox pages at mailinator.com.  We have found as we release a comic each week, we are slowly habituating our audience to check mailinatorguy.com on a weekly basis – and if they are following us on twitter or are a Facebook fan, we have an excuse to contact them.

The feedback loop of the two sites is strong. Mailinator.com provides the free disposable email utility that brings new users by itself. The site's design reinforces the Mailinator Guy character and encourages our primary demographic, arguably also the primary demographic for web comics, to follow the comic.  The MailinatorGuy.com site provides a weekly enticement for a visit. That visit keeps the Mailinator brand fresh in our user's mind causing the loop.

Of course some users of Mailinator will never care about the comic. And some users of the comic will never care about mailinator.com.  And that's just fine - both sites have inherent stand-alone value.

At the time of this writing we have published six episodes of The Adventures of Mailinator Guy. Our advertising efforts, both paid and direct referrals from mailinator.com, have driven traffic that averaged 1,500 unique weekly users the first three weeks, to over 3,000 unique weekly visitors the following three weeks.  The most traffic occurs on Mondays when we have usually posted and announced a new episode.  Along with an increase in users, producing the comic has become more efficient and economical, and still pretty darn fun.

Importantly - traffic to mailinator.com has increased 15% in the 6 weeks since the launch of the comic and continues to grow.

These numbers are very encouraging but honestly the most important outcome is the increase in Mailinator brand recognition. The site's brand value will continue to grow and that's the real goal of this exercise.  

At the end of the day - everyone is happy.  More people know about the Mailinator brand - this makes us happy - and have either avoided some spam by using our free disposable email service, have had a laugh at the expense of Mailinator Guy - and his sidekick (you'll have to read the strip to find out about him), or both.

If you're up for following the adventures of Mailinator Guy yourself - here's the link to get you started:

The Adventures of Mailinator Guy! 






Wednesday, May 2, 2012

How to get your resume "Silicon Valley Ready" - Part I

Per my last post, I've been given the opportunity to review a nice pile of resumes. As I am prone to, this got me to obsess a tad over how the resumes were put together and more importantly, what each told me.

What I perceived as issues are, in retrospect, my fault, not the resume owner's. That's because, per the entire point of my last post, the start-up environment is radically different than the corporate IT department. And the latter is where many of these resumes came from (which is exactly what I wanted and asked for).

In many cases, I received a resume from someone that included the regular set of data - experience, education, skills, etc. But the ones that got me excited were the ones where the person included in the email links to the websites or mobile-apps they had built. As I've said - the number one selling point for you as an engineer to get a job in Silicon Valley is that you love this stuff. There's an age-old conundrum of new grads who say "Employers want me to have experience before they'll hire me - but how do I get experience if I can't get a job?"

In our business I'm happy to say - that problem does not exist. Simply because you don't need anyone to give you a job to build something. A website. A mobile app. Heck, a program that finds smaller sets of strings in larger ones.

I realized that's probably the number one thing I'm looking for. You can show me, with no doubts, that as a software engineer - you can build something. Start to finish.

Interestingly, I like to think that I also don't put that much weight into whether a project was a commercial success. If it was, that's nice but and maybe it's because you are not only a great engineer but you have an awesome product sense - who knows (it just might mean you were lucky too). And unless it wasn't a commercial success because it was poorly engineered, that's not really the point. The point is that you built it. Or at least some non-trivial part of it.

With that in mind - this article sprang forth on what I like to see in resumes. I'll point out that this isn't very different from what I looked for when I was on a hiring committee at Google (so there are at least some current Google engineers that are partially there because of these thoughts).

First - I propose a new section to resumes - at least for software engineers. In addition to Experience, Education, Skills, Interest, and References (not suggesting we remove any of those) - I propose we add Cool Stuff I Have Built.

If your resume is going to go over one page (which, personally, I don't mind) - I'm hoping it's because of this section.

Any project you did solo or had a major hand in - whether paid or not paid, million users or just your mom, I'd love to know about. Websites, iphone apps, android apps, desktop apps, open-source projects, github accounts - you name it. Solo or as part of a team (indicate that). But it has to be, in one way or another "finished". Even if your iphone app was rejected by the app store - you can point me to a link to see it. It doesn't have to be a product either - maybe its an open-source project. The bottom line is it is something that you "finished". You executed. Your idea became a living breathing application or piece of code that in some way some how you could show to people.

The section might be broken up into individual projects with bullet points about each. For example:

Project Name: Mailinator
Technologies used: Java, tomcat, (no database)
Team Size:5'11", 175lbs (haha)
Implementation details/challenges:Custom server architecture built as an experimental test-bed for highly-multithreaded server design. Custom SMTP server. No database as emails are stored in RAM with a custom compression scheme.
Notable Metrics: up to 25MM emails per day, ~20k users per day, runs on a single server (on purpose as part of a personal experiment to optimize the system).
relevant links: www.mailinator.com, http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html


Surely you could add other bullet points too (and suggestions welcome, leave a comment to this post). But you get the idea.

My previous post resonated strongly with some people - that is, they were in "IT departments" feeling like they weren't growing technically. And as you can imagine, a resume telling me you did a payroll system is great - but it's not what (most) start-ups are building. But what if you haven't built anything? And your "Cool Stuff I've Built" section is empty?

Well .. fix that.

No one is stopping you from building something. No one said you're ready for a transition out of your current job today and as with much of life it's up to you to take yourself to the next level. But nearly any person that already writes software with a penchant for learning and some ambition can spend the next few months of nights and weekends learning and building. (And it's absolutely possible that your day job accomplishments belong in that cool stuff section too).

So if by day you're a payroll guy, but by night you're in iphone ninja - you've got my attention. Not only because you have the skills that I'm looking for - but that in your spare time, you're out doing great things. And instead of going out every night drinking with your friends - at least some of those nights, you chose to stay home and learn and build cool stuff. And why would you do something like that? Simple - you love this stuff.

(My start-up is located in Palo Alto and I am right now interviewing for the initial engineering team. We're well-funded, building cool stuff, and plan to change, ya know, the world. No matter where you are - if you're a software engineer, willing to relocate to San Francisco/Silicon Valley (and of course, love to build great things) send me your resume. paul@refresh.io or check out www.refresh.io/jobs)

Tuesday, February 21, 2012

How Mailinator compresses email by 90%

Given the title of this article, the first thing that should pop into your mind is probably - "well, use a compression algorithm - right?".

Right! Well, yes, well, not exactly. Read on.

Your second thought might also have been - "Why bother? Just buy more disks."  Which in the big picture is also not a bad answer. But for Mailinator that doesn't work - if you have read previous Mailinator tech articles you might know that Mailinator stores all it's email in RAM.

There were good reasons for that when Mailinator started. One was the use case - which was always disposable email that lasts a few hours (rather longer nowadays). Secondly, when Mailinator started, disks and datastores weren't as sophisticated/fast as they are now.

Also, Mailinator is/was always a free service so keeping costs down was always important. To this day, Mailinator runs on a single server. It averages about 4-5Terabytes of bandwidth a month and the peak incoming email rate I've seen is about 3500 emails/sec (this is just a production observation, server limit is bandwidth, not CPU).

And finally - last but not least - to me, much of web and application development today is utterly devoid of any fun algorithms. I spend a non-trivial amount of time in interpreted/dynamic scripting languages that do a fantastic job of hiding (or at least lure me away from thinking about) algorithmic complexity. I've probably inadvertently written more n^3 algorithms than, um, (n^3)-for-some-large-value-of-n.

Mailinator has always been my test bed for trying fun ideas, algorithms, and datastructures. In other words - I probably didn't need to do all the work I'm writing about here - but I definitely did have fun doing it (probably should have been out talking to girls, but alas).

Compression

Ok - so back to 90% compression.

So to start testing, I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85% !

Well. OK! Article is over! Everyone out! 85% is good enough.

Actually - there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.

That works great on static files - but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email.  If I compressed a million emails, and then some user wanted to read email #502,922 - I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible. And, as I said, the Mailinator cache is constantly throwing out old emails and putting in new ones.

In other words, an algorithm that relies on previous entries to build a dictionary can't work given that we keep purging the front of the stream never to be seen again.

Hence, we cannot compress emails "together". But we can compress them individually. Sadly, this hurts our compression ratio - and by a lot. The algorithm now must start building a new dictionary with each email. And emails are small so the dictionary isn't very mature by the time we're done compressing in many cases.

We can help this situation by giving the compression algorithm a pre-built dictionary. That is, scan a typical piece of data to be compressed, find common sequences and create a list of them. Then we give that dictionary to the compressor/decompressor as it takes off.

Woopsie. Again, the Mailinator stream is a living and breathing entity that's always changing. One minute might be a few million viagra spams, the next minute might be all about fake rolex watches. In other words, there is no "typical piece of data" -  a static dictionary built off a sample of emails will be obsolete in relatively short order.

So, the first idea was to build a sliding dictionary builder. Each email is scanned for string occurrences and we keep a count of them. Then every so often (minutes or hours), the compressor switches to using the most recently constructed dictionary. Every compressed email is given a reference to its dictionary so when/if it needs to be decompressed, it knows what dictionary to give the decompressor. Many thousands of emails share the same dictionary so RAM to store dictionaries isn't particularly significant.

Well, that's great and does restore LZMA back to about 60-70% but remember I mentioned I had  another problem with LZMA? Speed.

The C++ version of LZMA by Igor Pavlov compresses at about 1.7MB/s per CPU core  on my test machine. Um. no. Firstly, Mailinator can pull down tens of MB per sec at times. Secondly, no component of our processing pipeline can be allowed to take up this much CPU (my rule, not yours). We need our CPU for other things when large volumes of mail arrive. (The java version by the way was about the same speed).

Simply - LZMA is pretty awesome - but it's too slow for this purpose.

So the for the moment, I fell back to using a fast but simpler compression (zlib/LZW) on individual emails - and we sink down to about 40-50% savings from compression.

A Bigger Idea of a "Dictionary"

The next step for me was to think about email composition. We get lots of different types of email - but we get lots of the same types too. For example, we get lots of newsletters (people send them to Mailinator then read them via POP or RSS).

The nice thing for us is that a newsletter email blast could be 10,000 emails that are, all the same. Well, ok, not exactly - no two emails are ever the "same" because headers have times, dates, message-id's, etc. within them. But if we remove the headers, you can get 10,000 emails going into 10,000 different inboxes that all have the same message "body". Are you thinking what I'm thinking?

Right - store each email with it's own headers plus a pointer to ONE system-wide byte-array containing the newsletter body. What's the "compression" ratio of that? Well over 90%. And just to be a snot we can then apply compression to that byte array to eek out another few percent. We're reusing memory here so it's not exactly "compression" but we are reducing the size of the data sent to by some fantastic amount for this happy use case.

This isn't a revolutionary idea (online music libraries do the same thing) but it does fit pretty nicely in the Mailinator paradigm. Sadly apart from newsletters, not many other email sets, spam or otherwise have email bodies that are identical. In fact, spammers specifically change the subject line and destination url of every email they send for tracking and spam-detection-thwarting purposes. So what you get is something like this (headers omitted):

Email 1:
Buy vi4gra now!
http://rrr4.somerandomthing.com/?3jwow33oo
Happy man are you will be!

Email 2:
Buy vi4gra now!
http://1rr220.somerandomthing.com/?ajo200kko
Happy man are you will be!

So much for simply detecting identical email bodies. And this goes for less nefarious things too. Sign-up emails from websites will contain the same surrounding text with different names and validation urls inside.

What we could use here is a Longest Common Substring (LCS) algorithm. Basically, it would compare the two email bodies and be able to break them up as:

Common string 1:
Buy vi4gra now!\r\nhttp://

Disparate strings:
rrr4.somerandomthing.com/?3jwow33oo
1rr220.somerandomthing.com/?ajo200kko

Common string 2:
\r\nHappy man are you will be!

Nice .. each email is stored as 3 (compressed) byte arrays where 2 of those can be shared.

Unfortunately, classic LCS algorithms are expensive. Comparing two sequences is an O(nm) algorithm. And we're not interested in comparing two sequences, we're interested in comparing each new sequence (er.. each new email) with the few million that preceded it. Also, the LCS algorithm is also very memory expensive in the creation of trie datastructures - again, scaling to millions of emails just doesn't fit in our parameters.

Generally speaking, there are a lot of tricks I've noticed in analyzing algorithms. A few off the top of my head are: if you see an easy O(n^2) algorithm, it's rather likely there's an O(nlogn) one hiding in there somewhere. In contrast, if your dataset is small, you might be better off sticking to algorithms that make your CPU's cache and instruction pipeline happy instead of worrying about algorithmic running time (i.e. bubblesort > quicksort for small data). Lastly - if you can make assumptions about your data, you can often short-cut the classic algorithm with an good approximation.

Caching Lines

Cool, so let's assume something about the data. For emails, as it turns out, disparate parts of emails often occur on line boundaries (as you see in lines 1 & 3 above). A few same lines, a different one, a few more same. Instead of looking for common sequences based on individual characters, we can treat individual lines as units. Then we can attempt to find multiple occurrences of those lines. It cannot be as precise as LCS proper as in our above example (we would not find the identical portion "http://" in line 2) but we're basically settling for a greedy approximation, and one that works pretty well.

How do we store it though? LCS's tries would kill us. I know - let's use an LRU cache. Those darn things work for everything!  We can use an LRU cache that caches full email-lines. It will inherently flush out old email lines as the spam stream evolves (nice!) and will provide quick look- ups to compares thousands of lines at once (happy!). Specifically in Java, an LRU-cache is a synchronized LinkedHashMap with true as the last constructor parameter and an overridden removeEldestEntry.

So we store a few 10's of thousands of email lines in an LRU cache and then as each new email comes in, we check to see if that line is in the cache. If it is, we reuse the one in the cache instead of creating new storage for this email. By assuming all common sequences are bounded at newlines, we remove the boundary-discovery work LCS must do. Strictly speaking, we're cheating and losing some opportunity, but it's a good enough guess for this type of data.

This had a dramatic effect on our "compression" (again, it's slighty dubious to call it compression but, as you consider the big picture, our entire machinery of the LRU cache and bastardized LCS-in-spirit algorithm is creating a reuse-dictionary, it might not actually be compression - but it goes through several of the motions).

Caching Multi-lines

Caching lines is great - but what about caching multi-lines? Say we have a few emails - for brevity, assume each character in the following examples are email "lines":

Email 1:
ABC1

Email 2:
ABC2

Email 3:
ABC3

Email 4:
ABC4

So the first 3 lines are all the same in each email (ABC), the 4th lines are numbers which are not the same. Our algorithm:

1) Load a LINE and see if it's in the cache (if no more lines, quit)
2) .. if it's not there, put LINE in the cache, and store LINE in the email - GOTO 1
3) .. If it IS there:
4) .... see if LINE + NEXT_LINE is in the cache
5) .... if its not there, put LINE + NEXT_LINE into the cache and store LINE (which is a cache hit) in our email - GOTO 1
6) .... if it IS there, LINE = LINE + NEXT_LINE, - GOTO 4;

So if we run our 4 emails above through this algorithm. We get the following:

Running through all of email 1 - we get:
- Cache HITS stored in email: none
- Cache MISSES stored in email: A,B,C,1
- Cache contents afterwards (lru order): 1,C,B,A

Running through all of email 2 - we get:
- Cache HITS stored in email: A,B,C
- Cache MISSES stored in email: 2
- Cache contents afterwards (lru order): 2,C2,C,BC,B,AB,A,1

(notice how '1' (which didn't cache hit) has worked itself to the end)

Running through all of email 3 - we get:
- Cache HITS stored in email: AB,C
- Cache MISSES stored in email: 3
- Cache contents afterwards (lru order): 3,C3,ABC,AB,A,2,C2,BC,B,1

Running through all of email 4 - we get:
- Cache HITS stored in email: ABC   <-- very cool result, note coolness
- Cache MISSES stored in email: 4
- Cache contents afterwards (lru order): 4,ABC,AB,A,3,C3,C,2,C2,BC,B,1

So what happened? The system has realized that ABC is cacheable and is now pointing to that. All subsequent emails with the set-of-lines ABC will reuse the same memory. Note that the disparate lines 1,2,3, and 4 will always be stored separately, but the algorithm will then pick-up any common line-sets later in the email too (if there were any).

This elaborate system to find equal email lines and reuse them drags out compression of the entire flowing email stream down to about 80%. What about 90%?  Well.. one more trick.

Back to LZMA

Remember LZMA from above that we abandoned because it was too slow to happen inline? As you'd guess, the biggest impact it had was on bigger emails. And although it's a CPU hog, we do actually have a few cores laying around. So let's give it one (but seriously, just one).

We setup one core (i.e. thread) to trail behind and scan incoming email for ones that are over some size (say 20k) and re-compress those using the sliding dictionary LZMA we mentioned earlier. While 3 of our cores average 5-10% utilization by receiving, analyzing, and storing incoming email - the 4th core sits at 100% re-compressing emails where it will find benefit. If it gets too far behind, it simply leaps ahead and leaves some compression on the table.

(Note that empirically, LZMA is an order of magnitude faster decompressing than compressing, otherwise that would have been a new problem as it could take too long when someone wanted to read an email)

Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes better, sometimes worse and 2: I realize I'm not exactly sure what "Voila" means, looking that up now).

There are also some other important notes. Storing a byte array in Java costs something. The pointer alone (64bit) is 8bytes. Then there is the byte length field, padding, etc. In other words, I limited the system to never store email lines under 64 bytes. Small lines get concatenated together straight away.

Second, there are more email-idiomatic tweaks we can do to improve the situation. Base64-encoded attachments are effectively un-cacheable, so we pass over those.

Third, although from our cheeky example it may seem like we're finding optimal line sets (i.e. ABC). We're not. We could end up caching ABC and destroying an opportunity for a more optimal BCDXYZ or something. I'm guessing this doesn't happen often but would be an interesting future consideration.

Edit: Wow, sincere thanks to an Anonymous commenter for making me reconsider the above algorithm. I had originally stated it was O(n^2). My first version was indeed O(n^2) (which wasn't written about) and after a few changes it became O(n) and I failed to see that. I find its very easy to find tech reviewers once an article hits Hackers News, before then though - not so much. :)   My apologies for the error.

So for the end-user, this whole diatribe simply means little except their emails are sticking around longer. They have no idea that when they click to read an email we may be LZW or LZMA decompressing tens of byte arrays shared by thousands of emails with a custom-sliding dictionary built by scanning emails that arrived hours ago and then catenating them together so they can be shown on their webpage all in a few milliseconds. And they likely don't care, they're probably too busy signing up for Minecraft or something.

But that's ok. I know.

And if you got this far, you know too.

Ok.. now back to real work. What was I doing again? Oh yeah, writing some slick one-liners in Ruby. No clue on the running times - probably like O(n^4) or something, but if I fiddle with it a bit more - I bet I can cut the character count of the code by half!

Tuesday, June 9, 2009

A Beautiful Race Condition

I recently gave a keynote at the ICIS 2009 conference in Shanghai. The topic was why multithreaded programming seemed so easy, yet turned out to be so hard. The fun part was I investigated (per my last post and this one) several old, personal concurrency demons I knew existed but wanted to know more about.

One of those was, indeed, my favorite race condition. It doesn't escape me that its probably wholly unhealthy to even *have* a favorite race condition (akin to having a favorite pimple or something) - but nonetheless, the elegance of this one still makes my heart aflutter.

The scenario of this race is that we assume, not terribly inaccurately, that race conditions at times, can cause corrupted data. However, what if we have a situation where we sort of don't mind some corrupted data? A "good enough" application as it were.

The dangerous part of all this is if we assume (without digging in) what kind of data corruption can happen. As you'll see, you might just not get the type of data corruption you were hoping for (which is one of the sillier sentences I've ever written).

The particular instance of this kind of happy racing I've encountered is where someone uses a java.util.HashMap as a cache. I've never done such a thing myself, but I heard about this race and thus this analysis. They may use it with a linked-list or maybe just raw, but the baseline is that they figure a synchronized HashMap will be expensive - and in their case, a race condition inside the HashMap will just lose (or double up on) an entry now and then.

That is - a race condition between two (or more) threads might accidentally drop an entry causing an extra cache miss - no biggie. Or, it may cause one thread to re-cache an entry that didn't need it. Also no biggie. In other words, a slightly imprecise, yet very fast cache is ok by them. (of course, this assumption is dead wrong - don't do that - read on for why!)

So they setup a HashMap in some global manner, and allow any number of nefarious threads bang away on it. Let them put and get to their hearts content.

Now if you happen to know how HashMap works, if the size of the map exceeds a given threshold, it will act to resize the map. It does that by creating a new bucket array of twice the previous size, and then putting every old element into that new bucket array.

Here's the core of the loop that does that resize:


1:  // Transfer method in java.util.HashMap -
2:  // called to resize the hashmap
3:  
4:  for (int j = 0; j < src.length; j++) {
5:    Entry e = src[j];
6:    if (e != null) {
7:      src[j] = null;
8:      do {
9:      Entry next = e.next;
10:     int i = indexFor(e.hash, newCapacity);
11:     e.next = newTable[i];
12:     newTable[i] = e;
13:     e = next;
14:   } while (e != null);
15:   }
16: }


Simply, after line 9, variable e points to a node that is about to be put into the new (double-wide) bucket array. Variable
next
holds a reference to the next node in the existing table (because in line 11, we'll destroy that relation).

The goal is that nodes in the new table get scattered around a bit. There's no care to keep any ordering within a bucket (nor should there be). HashMap's don't care about ordering, they care about constant time access.

Graphically, let's say we start with the HashMap below. This one only has 2 buckets (the default of java.util.HashMap is 16) which will suffice for explanatory purposes (and save room).

As our loop starts, we assign e and next to A and B, respectively. The A node is about to be moved, the B node is next.



We have created a double-sized bucket array (in this case size=4) and migrate node A in iteration 1.



Iteration 2 moves node B and Iteration 3 moves node C. Note that next=null is the ending condition of our while loop for migrating any given bucket (read that again, its important to the end of the story).



Also important to the story, note that the migration inverted the order of Node's A and B. This was incidental to the smart idea of inserting new nodes at the top of the list instead of traversing to find the end each time and plunking them there. A normal put operation would still have to check that its inserting (and not replacing) but given a resize can't replace, this saves us a lot of "find the end" traversals.

Finally, after iteration 3, our new HashMap looks like this:



Our resize accomplished precisely the mission it set out to. It took our 3-deep bucket and morphed it into a 2-deep and 1-deep one.

Now, that's all well and good, but this article isn't about HashMap resizing (exactly), its about a race condition.

So, let's assume that in our original happy HashMap (the one above with just 2 buckets) we have two threads. And both of those threads enter the map for some operation. And both of those threads simultaneously realize the map needs a resize. So, simultaneously they both go try to do that.

As an aside, the fact that this HashMap is unsynchronized opens it up to a scary array of unimaginable visibility issues but that's another story. I'm sure that using an unsynchronized HashMap in this fashion can wrack evil in ways unlike man has ever seen, I'm just addressing one possible race in one possible scenario.

Ok.. back to the story.

So two threads, which we'll cleverly name Thread1 and Thread2 are off to do a resize. Let's say Thread1 beats Thread2 by a moment. And let's say Thread1 (by the way, the fun part about analyzing race conditions is that nearly anything can happen - so you can say "Let's say" all darn day long and you'll probably be right!) gets to line 10 and stops. Thats right, after executing line 9, Thread1 gets kicked out of the (proverbial) CPU.



1:  // Transfer method in java.util.HashMap -
2:  // called to resize the hashmap
3:  
4:  for (int j = 0; j < src.length; j++) {
5:    Entry e = src[j];
6:    if (e != null) {
7:      src[j] = null;
8:      do {
9:      Entry next = e.next;
     // Thread1 STOPS RIGHT HERE
10:     int i = indexFor(e.hash, newCapacity);
11:     e.next = newTable[i];
12:     newTable[i] = e;
13:     e = next;
14:   } while (e != null);
15:   }
16: }


Since it passed line 9, Thread1 did get to set its e and next variables. The situation looks like this (I've renamed e and next to e1 and next1 to keep them straight between the two threads as both threads have their own e and next).



Again, Thread1 didn't actually get to move any nodes (by this time in the code, it did allocate a new bucket array).

What happens next? Thread2, that's what. Luckily, what Thread2 does is simple - let's say it runs through the full resize. All the way. It completes.

We get this:



Note that e1 and next1 still point to the same nodes. But those nodes got shuffled around. And most importantly the next relation got reversed.

That is, when Thread1 started, it had node A with its next as node B. Now, its the opposite, node B has its next as node A.

Sadly (and paramount to the plot of this story) is that Thread1 doesn't know that. If you're thinking that the invertedness of Thread1's e1 and next1 are important, you're right.

Here's Thread1's next few iterations. We start with Thread2's bucket picture because thats really the correct "next" relations for our nodes now.







Everything sort of looking ok.. except for our e and next at this point. The next iteration will plunk A into the front of the bucket 3 list (it is after all, next). And will assign its next to whatever happens to already be there - that is, node B.



Woah. Thar she blows.

So right about now Thread1 goes into, what we like to call in the biz, an "infinite loop". Any subsequent get operations that hit this bucket start searching down the list and, go into, yep - an infinite loop. Any put operation that first needs to scan the nodes to see if its going to do a replace, will, you guessed it, go into an infinite loop. Basically, this map is a pit of infinite loopeness.

If you remember we noted that race conditions cause data corruption. Well, yeah, thats what we have here - just very unlucky data corruption on pointer structures. I'm the first to admit this stuff is tricky - if you found errors in my analysis I'll happily fix this post.

Now I had the happy fortune for a time of sharing an office with Josh Bloch who wrote java.util.HashMap. When I innocently mentioned he had a bug in his code given this behavior, he did indeed (to use Josh's word's) go non-linear on me.

And he was right. This is not a bug. HashMap is built specifically for its purpose and this implementation is not intended as threadsafe. There's a gaggle of ways to make it threadsafe, but in plain, vanilla, (and very fast) form - its not. And needless to say, you shouldn't be using it that way.

Race conditions are nothing to mess with and the worst ones are the ones that don't crash your program but let it wander down some unintended path. Synchronization isn't just for fun you know.

And nefarious uses of HashMap aside, I still attest - this is, indeed, a beautiful race.

Addendum: I've been yelled at a few times for calling any race condition "beautiful". I'll defend myself by our apparently human nature to generally call intricate complexity beautiful (i.e. waves crashing on a shore, nature in general).

Most races end up being about data-corruption. This one is data-corruption that manifests as control-flow-corruption. And it does so fantastically without an error (infinite loops notwithstanding).

As the evolution analogy goes, if you drive a needle into a pocket watch - chances are you'll simply destroy it. But there's a tiny chance you'll actually make it a better watch (clearly, not the case here). And another tiny chance you'll simply make it something "different" - but still, per se, functioning.

Again, my use of "beautiful" might be more appropriate as "a complex mutation with surprising non-destruction" :)

Mailinator Private Domains

The public Mailinator system has always had a privacy policy somewhere along the lines of "There is No Privacy". (see our  Privacy...