Tuesday, August 21, 2012

I Further Predict the Death of Your Web Framework

In my last post, I philosophized how new technology is going to change the bottleneck of web (and other) systems (which despite everything else, have remained surprisingly stable for awhile).

This spelled the demise of many systems that relied on a given system bottleneck, specifically, slow runtime systems.

But there is another technological shift conspiring against many web frameworks that isn't focused on performance, but instead focused on "ease of use" - which in many cases may hit far closer to home.

That shift is the reorganization of MVC.

MVC stands for "model-view-controller" which loosely means you have a datastore/database (the model) which is retrieved and manipulated (by the controller) such that it can finally be shown to a user (the view).

That is, pretty much always the flow. Well - kinda. The first thing you might notice is that "MVC" is an out-of-order acronym per the dataflow. In that case it would be "MCV". And happily, given that dataflow is paramount to my story - I'll use that in the rest of this article (that might irk you if you're CDO (which is like "OCD", except in alphabetical order, like it SHOULD BE)).

A predecessor to MCV was a simpler idea of simply "client/server" where client was the view, server was the controller and model (or, some or all of the controller could be in the client too). However, the client in that case actually implied it was a real client - that is a program that received the data and showed it.

In the web, the browser is the client, but interestingly in things like Rails, Jails, Nails, Grails, Struts, Play!, PHP, ASP.Net, and many others the "view" is on the server which then renders HTML and sends that to the browser. As far as the programmer is concerned, the whole MCV is on the server. The browser is often just a dumb terminal.

In the last year or two however, the popularity of a new type of framework is changing all that.

That change is coming from libraries such as backbone.js and ember.js (and many, many others).

Those libraries allow you to render views (not just show, actually render) in the browser itself. In addition, they let you leverage a lot of javascript magic in the browser. This is pretty awesome for several reasons.

The computing power of rendering is moved to the client's machine. Rendering isn't probably your biggest computing expense, but take off that computing cost from your server (times every web request you get) and its measurable.

And as you can imagine, if the "V" of MCV actually migrates to the client, all that's left on the server is "MC" (to be fair, sometimes even part of the "C" goes to the client).

What thousands and thousands of Rails developers discovered upon moving to backbone is that they no longer needed their fancy template views. Their backend became a system that pushed JSON over HTTP.

Very clean and very simple. At my new company Refresh (we're hiring!), our backend pushes the exact same JSON to our webpage as it does to our IOS app. And that same system will someday seamlessly become our API too.

For me, using Rails for webapps over Java (where I spent plenty of time years ago) was a simple decision. ActiveRecord was beautiful and elegant (especially compared to things like Java's hibernate). Also, the view layer was simple, well-laid-out, and standardized. If anything, Java had too many choices.

But these days, I tend to use NoSQL on the backend. And ember on the front-end. All I need in the middle is something to manipulate and push JSON. Why was I paying the Rails tax? (again insert any language that is a multiple slower than Java in that sentence).

I'm not particularly picking on Rails - it is just a full MCV solution that I no longer need. There are plenty of those.

And if you're thinking this is a win for Node.js - you're probably right. With much more javascript coding entering your web framework as a whole, using Node on the backend is probably the winner of all this on the usability front. Javascript on the server isn't the fastest, but it's pretty darn good at manipulating JSON (and thank you to whoever it was that shot XML dead).

So my not-so-amazing prediction is that in a few short years time full web frameworks from any languages disappear. Node picks up some of that slack but so do less feature-ful frameworks (and maybe performant ones). Even non-frameworks altogether get more use.

There's surely no mourning required here. Web frameworks change every few years no matter how you slice it. But between this post and my last, I see two converging fronts out to kill some our most popular ones right now.

Personally, I'm hoping to never server-side render HTML again. I'll let your browser do my rendering while I sit back, chill, and push some JSON.

And yes, Mailinator is in rewrite now to use ember, much to the chagrin of web scraping programs everywhere! (but much to the happy of JSON receivers)

Monday, August 13, 2012

Your Bottleneck is Dead. Long Live Your Bottleneck.

There's an old joke that, if you think about it, you can apply directly to system bottlenecks.

Two hikers are walking through the woods when they come face-to-face with a pack of wolves. One of the hikers immediately drops to the ground and hastily changes from his hiking boots to the running shoes he had in his backpack.

The 2nd hiker says, "What are you doing ! You can't outrun those wolves!"

The 1st replies, "I don't have to outrun those wolves. I just have to outrun you."


Web developers tend to know their system's biggest bottleneck, but how often do you know the 2nd biggest one? Right, in one sense it doesn't matter - because the 2nd bottleneck doesn't get to become a bottleneck unless it's the biggest one.

There is an economic model hidden within every complex system. This includes something as mundane as web system performance. Knuth famously said (or re-said) that "premature optimization is the root of all evil" which could be restated as - if you optimize before you know what needs it, you're optimizing (and probably breaking) the wrong thing.

Hence we don't (or aren't supposed to) optimize what doesn't need it. Seems obvious - but it has interesting ramifications.

When something doesn't need optimizing, we can afford to be (and often tend to be) lazy with it when it comes to performance. Concretely, if your code's database access is going to take a 15 milliseconds, worrying that processing that data will take 20 microseconds because of your sloppy n^2 algorithm probably isn't worth much thought.

If that statement raises your ire, feel free to sit in your chair and pout - because there are thousands of websites that were happily coded using notepad with interpreted and dynamic scripting languages that flagrantly use gotos and lists as if they were hashmaps. I've seen it. It's enough to turn your stomach. It's not pretty.

For the average, everyday web hardware ecosystem - we have CPU power to spare. And in the bigger business sense, if I can save time developing a website cutting performance-concerned corners with no ramifications, all the better.

Web development largely started in scripting languages (i.e. perl-cgi, php). Again for the same reason - CPU to spare as compared to other bottlenecks.

In fact, I'll go so far as to say that the popularity of scripting language web frameworks required the condition that disks be some order of magnitude slower than CPUs. That's right - I'm looking at you Rails, Grails, Nails, and Jails.. (ok, not Jails, it's a Java web framework but it rhymed).

Java web frameworks added a lot of structure, verbosity, and performance that simply wasn't needed (and eventually, amazing bloat). If your bottleneck was the database/disk - your web processing simply had to not add significantly to that - and regardless of the language, that wasn't hard.

A simple definition of latency is the time it takes to get data back after requesting it. Similarly to an program anyway, bandwidth could be viewed as how long it takes us to get all the data requested (once you start getting any).

Think of how that relates to code performance. If your latency is 3ms (a reasonable server harddisk seek time) - it doesn't matter if your code is hand-loved machine language or interpreted COBOL - it does nothing for that 3ms. In CPU time, 3ms is an eternity.

As a general tendency however, the more data you receive, the more processing that likely goes around it. Consider a few megabyte JSON message - at a minimum it will likely be parsed. Possibly shoved into a map or an object.

Said another way - lowering latency and increasing bandwidth will tend to put more pressure on processing data (i.e. requiring more CPU/code performance)

So all this time we're happily and harshly slowed down by slow things like spindly harddisk drives and networks. Then, in walk Solid State Drives. Prices and capacities are both heading in the normal directions for new technology (down and up, respectively).

Latency goes from standard spindle drive 3ms seeks to (varying reports) 100microsecond seeks.

Argue the specifics if you will, but for some number of existing systems, installing an SSD will remove the database as the primary bottleneck. In fact, this is probably the cheapest way to improve your system's performance today.

What happens to the bottleneck in those systems? It will shift somewhere else (i.e. the SSD put on its running shoes). In many cases, it will shift to the CPU (CPU in this case is a polite way of saying "your code").

Everyday across the world, there are meetings at companies complaining about the performance of their website. Today, many of those say "get the DBA in here".

In some of those meetings soon, the shift will be away from blaming the database. Some will push for code optimization (postmature), some for bigger hardware, and some for faster languages.

Keep in mind, this is a subtle, slow moving effect. Having your CPUs pegged all the time might not make you change anything today but may make you reconsider your architecture next time you build something.

Of course the network is a bottleneck too - at least for now. In places like Korea and Kansas City that's not so true. If you haven't heard, if you live in Kansas City you can get Google Fiber to your home. In other words, your internet speed will be 100 times faster than the average internet in the US. (In fact, if your machine has the common SATA2 disk drive interface, sending a file to your neighbor across town in Kansas City will only take about 3 times as long as storing it on your own disk just a few inches away).

Here's another prediction - in 5 years the phrase "downloading a movie" won't exist. (We used to say we were "downloading an image" which was preceded by us saying we were just "downloading").

If bandwidth drastically increases, it will change how we write code. We think nothing of loading a 1M webpage now which 10 years ago was offensive. In the future, we may think the same thing about a 100M webpage.

Given that data expands to fill available bandwidth (modified Parkinson's Law) our programs will tend to process much more data. Processing speed will matter more and more.

And the more often code becomes the bottleneck, the more often solutions to fix that will be considered.

Simply - your favorite bottlenecks might be changing. And for that to happen, your disk doesn't necessarily need to be able outrun your CPU - it just has to be able to outrun your code. (And it wouldn't hurt if it could also outrun, you know, wolves too).

My startup Refresh is looking for awesome IOS and front-end engineers. Join us! Email us at jobs@refresh.io De

Wednesday, May 16, 2012

The love and hate of Node.js

I've been in the happy position lately of interviewing engineers for my new start-up. If you've read any of my previous articles or seen my talks (Talk Notes (warning: PDF download)) you know I love this stuff (Jobs page).

I'm always up for a spirited discussion about algorithms or languages with smart people, but I do consider too much technical religion to be a red flag and it seems to be a rather common affliction.

So when I started interviewing recently, I was immediately reminded of the blind loyalty some times given to pieces of technology. As if all other competing languages/frameworks/variable-naming-schemes are "crap" where "crap" is loosely defined as, well, I'm not sure - but something that the person saying "crap" sure doesn't like. It's probably safe to say that any popular technology has (or had) a useful purpose at one point, and also I think it's safe to say that same piece of technology is not always the right solution.

Along with the perpetual Hackers news debates, I right away ran into a "Node Guy". That is, a guy maybe just a tad overzealous about using Node.js for, you know, everything.

I asked him why he chose Node for his last project. The answer was "Because it scales super well".

I will say, the marketing hype around Node is pretty good. I am not saying his answer was wrong. It wasn't. But it's pretty similar to answering the question "Why do you want to buy a Porsche?" - with the answer "Because it's fast".

Likely true, but by no means a distinguishing feature.

It isn't hard to find discussions in the developer community defending the performance of Node. Node at its core, is a C++ server. That part is likely competitively fast. But after that, the focus is on the performance of the JavaScript callouts. Someone told me that "microbenchmarks" aren't fair to Node and don't show the true performance. I think they're right in both cases - because microbenchmarks likely involve only a small amount of JavaScript. Truth is, the more JavaScript your Node application runs, the more its going to lose against server frameworks built in faster languages. In other words, microbenchmarks are probably the best Node is ever going to look.

Google's V8 JavaScript engine literally enabled Node to exist at all.
There are of course another set of people (the "Node haters") that are nothing short of incensed by the idea of Node. To them, it feels rather silly to create a server framework in a language like JavaScript. I can relate to someone who has spent years eek'ing out all possible performance of a C++ server only to watch someone write one in JavaScript and claim speediness.

In the early days of any field of science - science, invention, and engineering must overlap. That is, folks think up science, try it, and piece it together to see if it works - however rickety. Eventually however, enough tools and best-practices exist to allow details to be abstracted.

When that happens, many more people can create cool things with existing and tested pieces (i.e. engineer them together). Simply, you need to worry less about the details of the science to get things done. People with no knowledge of the underlying science can glue widgets together to make something. Often amazing things - at that time, you might consider that that "science" is somewhat beginning its evolution towards being an "art".

Possibly the quintessential computer science course is something like "Algorithms & Data Structures". Do you need that course to develop apps these days? Again, by proof of existence - I think not. If you have a phone interview with me I will ask you the difference between an Growable Array (aka Vector, ArrayList, etc) and a Linked List. Both structures do arguably the same thing but with notably different performance characteristics.

It's quite hard to create any application without using some form of list, but as long you have a "list" you know you can get the job done. I promise you there are thousands of sites and apps using the wrong list at the wrong time and incurring nasty linear time searches or insertions in nasty places. Truth is, if that never shows up as a measurable bottleneck, then one could argue that despite the algorithmic transgression - that code is "good enough".

Happy or sad, "good enough" is getting "good enougher" all the time. CPUs are fast and getting faster covering the tracks of slow code. We've never lived in a better time for algorithmic indifference. Comparatively, disks are slow, which make databases slow, which make the performance of the algorithms and languages you choose in your app less important than ever (not to be confused with "unimportant"). In fact, I'd argue that the entire resurgence of interpreted, dynamic languages can be traced back to the lackadaisical 5ms seek times of spinning disk drives. That's a bold statement and probably a whole other article - but if disks/databases are basically always the bottleneck (rather true in most web apps) - who cares how fast your language runs.

(disclaimer: If you've read anything else I've ever wrote you know I'm merely an observer of this trend, not a subscriber)

The controversy over Node is that it implies that developers from the client are piercing into the server. A domain typically full of people that came up from the OS layer. Those people are asking does it really make sense to write servers in a historically (slow) client language?

Further, and possibly a bit more personally, should people who only know client languages be writing servers at all? Server code is a unique skill just like many other things. Dabbling in writing servers is like me dabbling in doing web design - trust me, it's not pretty. There's only so many lower levels left - would you want a JavaScript operating system?

On the notable other hand - People who only know or love that client language have been given a whole new freedom and ability. They'll argue (with a rather reasonable defense) that Node.js represents one of the easiest ways to create a server or web app. Even if they don't defend the performance, in many practical cases, they don't need to - like it or not at the right time it can be "good enough" (again, proof by existence). It's positively no wonder they defend Node. They are defending their newly found, wondrous abilities that can solve real problems. Wouldn't you?

So as my information-less friend said - Node will scale. But that is, indeed information-less. So can Ruby, Rails, Java, C++, and COBOL - architectures scale - languages and servers don't. Just like most web apps, a Node application will probably be bottle-necked at its database. You can fool yourself that Node itself is "insanely fast" but you'd be fooling yourself (Java/Vert.x vs. Node, Java/Jetty vs. Node, Node vs. lots) and rest assured that despite scaling, some portion of your latency is baked into your language/framework performance. Your users are paying milliseconds for your choice of an interpreted and/or dynamic language.

Should your start-up use Node? That depends on a lot of things. If history is a teacher however, massive success will likely push you to something statically typed. Facebook "fixed" PHP by compiling PHP to C++ and Twitter painfully (after years of fail whales) migrated from Ruby to Scala/Java. Square has migrated to JRuby to improve performance, I'll be interested to watch if its enough (I'm feeling yet another article on the nefarious demons upon drop-in replacing a global-interpreter-locked Ruby with a true multithreaded one).

The fight over Node is, in truth, one of the least truly technical developer fights I've seen. People on both sides are simply defending their abilities and maybe their livelihoods - the technical points are rather obvious. I'd say Node is definitely a possible solution for some non-trivial set of problems - then again, I can think of plenty of situations I'd also veer away from it. But of course - I'm not very technically religious - and I'm definitely not a "Node Guy".

All this being said - I am seriously hiring across the stack (including mobile) at my new start-up. If you have a desire to argue with me about this sort of stuff on a daily basis - email me at paul@refresh.io and attach a resume.

This article was spawned from my own comment at William Edwards blog

Wednesday, May 2, 2012

How to get your resume "Silicon Valley Ready" - Part I

Per my last post, I've been given the opportunity to review a nice pile of resumes. As I am prone to, this got me to obsess a tad over how the resumes were put together and more importantly, what each told me.

What I perceived as issues are, in retrospect, my fault, not the resume owner's. That's because, per the entire point of my last post, the start-up environment is radically different than the corporate IT department. And the latter is where many of these resumes came from (which is exactly what I wanted and asked for).

In many cases, I received a resume from someone that included the regular set of data - experience, education, skills, etc. But the ones that got me excited were the ones where the person included in the email links to the websites or mobile-apps they had built. As I've said - the number one selling point for you as an engineer to get a job in Silicon Valley is that you love this stuff. There's an age-old conundrum of new grads who say "Employers want me to have experience before they'll hire me - but how do I get experience if I can't get a job?"

In our business I'm happy to say - that problem does not exist. Simply because you don't need anyone to give you a job to build something. A website. A mobile app. Heck, a program that finds smaller sets of strings in larger ones.

I realized that's probably the number one thing I'm looking for. You can show me, with no doubts, that as a software engineer - you can build something. Start to finish.

Interestingly, I like to think that I also don't put that much weight into whether a project was a commercial success. If it was, that's nice but and maybe it's because you are not only a great engineer but you have an awesome product sense - who knows (it just might mean you were lucky too). And unless it wasn't a commercial success because it was poorly engineered, that's not really the point. The point is that you built it. Or at least some non-trivial part of it.

With that in mind - this article sprang forth on what I like to see in resumes. I'll point out that this isn't very different from what I looked for when I was on a hiring committee at Google (so there are at least some current Google engineers that are partially there because of these thoughts).

First - I propose a new section to resumes - at least for software engineers. In addition to Experience, Education, Skills, Interest, and References (not suggesting we remove any of those) - I propose we add Cool Stuff I Have Built.

If your resume is going to go over one page (which, personally, I don't mind) - I'm hoping it's because of this section.

Any project you did solo or had a major hand in - whether paid or not paid, million users or just your mom, I'd love to know about. Websites, iphone apps, android apps, desktop apps, open-source projects, github accounts - you name it. Solo or as part of a team (indicate that). But it has to be, in one way or another "finished". Even if your iphone app was rejected by the app store - you can point me to a link to see it. It doesn't have to be a product either - maybe its an open-source project. The bottom line is it is something that you "finished". You executed. Your idea became a living breathing application or piece of code that in some way some how you could show to people.

The section might be broken up into individual projects with bullet points about each. For example:

Project Name: Mailinator
Technologies used: Java, tomcat, (no database)
Team Size:5'11", 175lbs (haha)
Implementation details/challenges:Custom server architecture built as an experimental test-bed for highly-multithreaded server design. Custom SMTP server. No database as emails are stored in RAM with a custom compression scheme.
Notable Metrics: up to 25MM emails per day, ~20k users per day, runs on a single server (on purpose as part of a personal experiment to optimize the system).
relevant links: www.mailinator.com, http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html


Surely you could add other bullet points too (and suggestions welcome, leave a comment to this post). But you get the idea.

My previous post resonated strongly with some people - that is, they were in "IT departments" feeling like they weren't growing technically. And as you can imagine, a resume telling me you did a payroll system is great - but it's not what (most) start-ups are building. But what if you haven't built anything? And your "Cool Stuff I've Built" section is empty?

Well .. fix that.

No one is stopping you from building something. No one said you're ready for a transition out of your current job today and as with much of life it's up to you to take yourself to the next level. But nearly any person that already writes software with a penchant for learning and some ambition can spend the next few months of nights and weekends learning and building. (And it's absolutely possible that your day job accomplishments belong in that cool stuff section too).

So if by day you're a payroll guy, but by night you're in iphone ninja - you've got my attention. Not only because you have the skills that I'm looking for - but that in your spare time, you're out doing great things. And instead of going out every night drinking with your friends - at least some of those nights, you chose to stay home and learn and build cool stuff. And why would you do something like that? Simple - you love this stuff.

(My start-up is located in Palo Alto and I am right now interviewing for the initial engineering team. We're well-funded, building cool stuff, and plan to change, ya know, the world. No matter where you are - if you're a software engineer, willing to relocate to San Francisco/Silicon Valley (and of course, love to build great things) send me your resume. paul@refresh.io or check out www.refresh.io/jobs)

Tuesday, April 24, 2012

Why you should join a start-up - and maybe why you shouldn't

I've recently been interviewing engineers for my new start-up (fyi, this is wholly separate from Mailinator). We're well-funded, have a world-changing idea, and as you can imagine, I plan to build an awesome engineering team. (Regardless of where you are, if you're a passionate developer, I'd love to hear from you. Check out the Job Description here and email me your resume at paul@refresh.io ).

I've been talking to engineers from all over hearing their stories. There's really amazing talent everywhere and honestly, a non-trivial amount of it seems to be idling or even decaying in environments that aren't using its full potential. A bunch of moons ago I used to work for Dow Chemical in the dreaded "IT department". It was pretty clear to me then that I was not growing technically in that job. I left to start my Ph.D. but I always vowed from then on that if I was going to be a software guy, I was going to work for companies who's business was creating software. In other words, at Dow I was an expense, I'd much rather be an asset.

Eventually and with that goal in mind, I ended up at Google. Without reservation I can say it was a fantastic experience.  I have said before, "if you're the smartest person at where you work - quit". And trust me, nothing makes you realize how smart people can actually be by working at a place like Google. (To avoid any implications - I did eventually quit Google, but rest heavily assured, it was not because I anywhere even close to being the smartest person - read on!).

I did over 200 interviews while at Google and it was actually a bit fun to interview someone who was coming from someplace where they were the smartest person (at least about tech). I could always tell. It's no surprise that if you're the smartest person somewhere for a long time, you get used to it. You get used to waiting for people to catch up to where you are.

By no fault of their own they walked into the interview with some attitude. An attitude of impatience if nothing else. After someone like that started at Google however, it didn't take long for them to realize the situation they were now in. It was humbling in many respects and I don't mean that negatively, simply they'd not recently (or ever) experienced a place where many of the people they met were at their level or better. Obviously, there are smart people everywhere but almost universally, smart people enjoy the company of others like them, the synergy makes them all better. This is why Silicon Valley is a magnet for them.

As I said, Google (and similarly Facebook, etc) are great places to work. At some point after working there however I thought to myself what a wonderfully steady and safe place to work it was. My responsibilities, expectations, and compensation package were well outlined. I was working with awesome people and learned a ton but I still felt it was far too big for me to have any real impact.

For a time, I worked on the Google Web Server which I could best describe to non-techies as "well, sort of the thing you interact with when you do a search" (this is a bad definition at best). A woman I was dating thought about that answer a moment and condescendingly replied - "what do you mean you work on that - isn't that done?"
In one sense she was right, I worked on that darn thing every day but to her it all worked the same. To her, I was having no impact.

It occurred to me that Google would be a fantastic place to work if what I wanted was a meaningful 9-5 job that after each day of work I could drive my minivan back to my home in the suburbs. But I didn't have a minivan. And I didn't own a home. And I didn't live anywhere near the suburbs. What the heck was I doing there? The smart-person environment was at start-ups too - I could get that there and even have some ownership of what I was building.

It's a relatively normal course of life in our sea of first-world problems that you'll have many chances to take risks early in life and those chances diminish as time goes on. Simply put, Google will always be there. And if Google isn't - the next big, awesome company will be. Every decade or so has a "company" (or two) where the greatest things and the greatest people are happening. At times it was Microsoft, Cisco, Apple, Google, Facebook, etc.

I left Google not because Google was in any way bad, but because I wasn't done swinging for the fence. And I still had the luxury of trying. If I ever got to the point where I wanted to realign my life's risk profile, Google (or Google-next) would be there. And this is a pretty common theme - places like Google and Facebook incubate some set of people into entrepreneurs who then go start their own start-ups. But with big ideas, agility, and impact. And they don't tend to fall far from the tree. You might think Google doesn't like this - but I doubt that's true. This is a constant stream of risk-takers that go try stuff for them that they can buy back if needed.

What gets me today is how vibrant Silicon Valley is right now. And even for Silicon Valley this place is on-fire. It seems cities around the world try to copy it but that's really hard to do. The start-ups are here because the investors are here, and the investors are here because the start-ups are here. Guy Kawasaki wrote a great article several years ago partially about why Silicon Valley is Silicon Valley.

I am fully aware that Silicon Valley has a nasty habit of simply not being able to darn well shut-up about Silicon Valley. Other cities are hotbeds for tech too (Austin, NYC, etc), but truth be told, you could find a cadre of smart engineers doing a great start-up in a Des Moines, but it's not easy. There's LOTS of great companies in Silicon Valley that can take you to the next level.

We're in the midst of a huge wave. Depending on your risk profile, joining a start-up or joining "a Google" is the best way to put your chips in the game. Regardless of where you are - if you're a crack-shot engineer looking to change the world, you could do worse than coming here. Again it's all about your risk profile and what's keeping you where you are (which may be great reasons). Start-ups will not only pay to relocate you, we'll put you up for a few months (in the corporate crash-pad) while you find your own place. Joining a start-up now will get you experience both technically and start-up-wise that you can't get anywhere else.

I'm not thinking the start-up life is for everyone. I can definitely see a point in my life or where I have life-constraints where I'll want my job to be a less important part of my life (probably because my life will be more about, well, you know - just "life"). But for me right now, and maybe for you - I'm swinging for the fence. And love or hate IT departments, I couldn't do that there.


Again, if you're a software engineer that loves what you do and lives in commuting distance to Palo Alto, CA or is willing to relocate, I'd really love to hear from you. We're well-funded and I'm literally building the first engineering team right now. It's a fantastic opportunity to get in on the ground-floor of a great start-up. Refresh.io jobs

Thursday, April 5, 2012

Mailinator sponsors Wasteland 2

Mailinator is a proud sponsor of the Kickstarter project Wasteland 2.

http://www.kickstarter.com/projects/inxile/wasteland-2

(Boy did I waste some weeks on Wasteland-1 a long time ago).

Anyway, we sponsored at the $1000 level giving us the ability to name an in-game weapon.

When you play - expect to find a (very) big gun called "The Mailinator". Let them eat spam!

(I hope I can get them to have it shoot spam!)

Tuesday, February 21, 2012

How Mailinator compresses email by 90%

Given the title of this article, the first thing that should pop into your mind is probably - "well, use a compression algorithm - right?".

Right! Well, yes, well, not exactly. Read on.

Your second thought might also have been - "Why bother? Just buy more disks."  Which in the big picture is also not a bad answer. But for Mailinator that doesn't work - if you have read previous Mailinator tech articles you might know that Mailinator stores all it's email in RAM.

There were good reasons for that when Mailinator started. One was the use case - which was always disposable email that lasts a few hours (rather longer nowadays). Secondly, when Mailinator started, disks and datastores weren't as sophisticated/fast as they are now.

Also, Mailinator is/was always a free service so keeping costs down was always important. To this day, Mailinator runs on a single server. It averages about 4-5Terabytes of bandwidth a month and the peak incoming email rate I've seen is about 3500 emails/sec (this is just a production observation, server limit is bandwidth, not CPU).

And finally - last but not least - to me, much of web and application development today is utterly devoid of any fun algorithms. I spend a non-trivial amount of time in interpreted/dynamic scripting languages that do a fantastic job of hiding (or at least lure me away from thinking about) algorithmic complexity. I've probably inadvertently written more n^3 algorithms than, um, (n^3)-for-some-large-value-of-n.

Mailinator has always been my test bed for trying fun ideas, algorithms, and datastructures. In other words - I probably didn't need to do all the work I'm writing about here - but I definitely did have fun doing it (probably should have been out talking to girls, but alas).

Compression

Ok - so back to 90% compression.

So to start testing, I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85% !

Well. OK! Article is over! Everyone out! 85% is good enough.

Actually - there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.

That works great on static files - but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email.  If I compressed a million emails, and then some user wanted to read email #502,922 - I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible. And, as I said, the Mailinator cache is constantly throwing out old emails and putting in new ones.

In other words, an algorithm that relies on previous entries to build a dictionary can't work given that we keep purging the front of the stream never to be seen again.

Hence, we cannot compress emails "together". But we can compress them individually. Sadly, this hurts our compression ratio - and by a lot. The algorithm now must start building a new dictionary with each email. And emails are small so the dictionary isn't very mature by the time we're done compressing in many cases.

We can help this situation by giving the compression algorithm a pre-built dictionary. That is, scan a typical piece of data to be compressed, find common sequences and create a list of them. Then we give that dictionary to the compressor/decompressor as it takes off.

Woopsie. Again, the Mailinator stream is a living and breathing entity that's always changing. One minute might be a few million viagra spams, the next minute might be all about fake rolex watches. In other words, there is no "typical piece of data" -  a static dictionary built off a sample of emails will be obsolete in relatively short order.

So, the first idea was to build a sliding dictionary builder. Each email is scanned for string occurrences and we keep a count of them. Then every so often (minutes or hours), the compressor switches to using the most recently constructed dictionary. Every compressed email is given a reference to its dictionary so when/if it needs to be decompressed, it knows what dictionary to give the decompressor. Many thousands of emails share the same dictionary so RAM to store dictionaries isn't particularly significant.

Well, that's great and does restore LZMA back to about 60-70% but remember I mentioned I had  another problem with LZMA? Speed.

The C++ version of LZMA by Igor Pavlov compresses at about 1.7MB/s per CPU core  on my test machine. Um. no. Firstly, Mailinator can pull down tens of MB per sec at times. Secondly, no component of our processing pipeline can be allowed to take up this much CPU (my rule, not yours). We need our CPU for other things when large volumes of mail arrive. (The java version by the way was about the same speed).

Simply - LZMA is pretty awesome - but it's too slow for this purpose.

So the for the moment, I fell back to using a fast but simpler compression (zlib/LZW) on individual emails - and we sink down to about 40-50% savings from compression.

A Bigger Idea of a "Dictionary"

The next step for me was to think about email composition. We get lots of different types of email - but we get lots of the same types too. For example, we get lots of newsletters (people send them to Mailinator then read them via POP or RSS).

The nice thing for us is that a newsletter email blast could be 10,000 emails that are, all the same. Well, ok, not exactly - no two emails are ever the "same" because headers have times, dates, message-id's, etc. within them. But if we remove the headers, you can get 10,000 emails going into 10,000 different inboxes that all have the same message "body". Are you thinking what I'm thinking?

Right - store each email with it's own headers plus a pointer to ONE system-wide byte-array containing the newsletter body. What's the "compression" ratio of that? Well over 90%. And just to be a snot we can then apply compression to that byte array to eek out another few percent. We're reusing memory here so it's not exactly "compression" but we are reducing the size of the data sent to by some fantastic amount for this happy use case.

This isn't a revolutionary idea (online music libraries do the same thing) but it does fit pretty nicely in the Mailinator paradigm. Sadly apart from newsletters, not many other email sets, spam or otherwise have email bodies that are identical. In fact, spammers specifically change the subject line and destination url of every email they send for tracking and spam-detection-thwarting purposes. So what you get is something like this (headers omitted):

Email 1:
Buy vi4gra now!
http://rrr4.somerandomthing.com/?3jwow33oo
Happy man are you will be!

Email 2:
Buy vi4gra now!
http://1rr220.somerandomthing.com/?ajo200kko
Happy man are you will be!

So much for simply detecting identical email bodies. And this goes for less nefarious things too. Sign-up emails from websites will contain the same surrounding text with different names and validation urls inside.

What we could use here is a Longest Common Substring (LCS) algorithm. Basically, it would compare the two email bodies and be able to break them up as:

Common string 1:
Buy vi4gra now!\r\nhttp://

Disparate strings:
rrr4.somerandomthing.com/?3jwow33oo
1rr220.somerandomthing.com/?ajo200kko

Common string 2:
\r\nHappy man are you will be!

Nice .. each email is stored as 3 (compressed) byte arrays where 2 of those can be shared.

Unfortunately, classic LCS algorithms are expensive. Comparing two sequences is an O(nm) algorithm. And we're not interested in comparing two sequences, we're interested in comparing each new sequence (er.. each new email) with the few million that preceded it. Also, the LCS algorithm is also very memory expensive in the creation of trie datastructures - again, scaling to millions of emails just doesn't fit in our parameters.

Generally speaking, there are a lot of tricks I've noticed in analyzing algorithms. A few off the top of my head are: if you see an easy O(n^2) algorithm, it's rather likely there's an O(nlogn) one hiding in there somewhere. In contrast, if your dataset is small, you might be better off sticking to algorithms that make your CPU's cache and instruction pipeline happy instead of worrying about algorithmic running time (i.e. bubblesort > quicksort for small data). Lastly - if you can make assumptions about your data, you can often short-cut the classic algorithm with an good approximation.

Caching Lines

Cool, so let's assume something about the data. For emails, as it turns out, disparate parts of emails often occur on line boundaries (as you see in lines 1 & 3 above). A few same lines, a different one, a few more same. Instead of looking for common sequences based on individual characters, we can treat individual lines as units. Then we can attempt to find multiple occurrences of those lines. It cannot be as precise as LCS proper as in our above example (we would not find the identical portion "http://" in line 2) but we're basically settling for a greedy approximation, and one that works pretty well.

How do we store it though? LCS's tries would kill us. I know - let's use an LRU cache. Those darn things work for everything!  We can use an LRU cache that caches full email-lines. It will inherently flush out old email lines as the spam stream evolves (nice!) and will provide quick look- ups to compares thousands of lines at once (happy!). Specifically in Java, an LRU-cache is a synchronized LinkedHashMap with true as the last constructor parameter and an overridden removeEldestEntry.

So we store a few 10's of thousands of email lines in an LRU cache and then as each new email comes in, we check to see if that line is in the cache. If it is, we reuse the one in the cache instead of creating new storage for this email. By assuming all common sequences are bounded at newlines, we remove the boundary-discovery work LCS must do. Strictly speaking, we're cheating and losing some opportunity, but it's a good enough guess for this type of data.

This had a dramatic effect on our "compression" (again, it's slighty dubious to call it compression but, as you consider the big picture, our entire machinery of the LRU cache and bastardized LCS-in-spirit algorithm is creating a reuse-dictionary, it might not actually be compression - but it goes through several of the motions).

Caching Multi-lines

Caching lines is great - but what about caching multi-lines? Say we have a few emails - for brevity, assume each character in the following examples are email "lines":

Email 1:
ABC1

Email 2:
ABC2

Email 3:
ABC3

Email 4:
ABC4

So the first 3 lines are all the same in each email (ABC), the 4th lines are numbers which are not the same. Our algorithm:

1) Load a LINE and see if it's in the cache (if no more lines, quit)
2) .. if it's not there, put LINE in the cache, and store LINE in the email - GOTO 1
3) .. If it IS there:
4) .... see if LINE + NEXT_LINE is in the cache
5) .... if its not there, put LINE + NEXT_LINE into the cache and store LINE (which is a cache hit) in our email - GOTO 1
6) .... if it IS there, LINE = LINE + NEXT_LINE, - GOTO 4;

So if we run our 4 emails above through this algorithm. We get the following:

Running through all of email 1 - we get:
- Cache HITS stored in email: none
- Cache MISSES stored in email: A,B,C,1
- Cache contents afterwards (lru order): 1,C,B,A

Running through all of email 2 - we get:
- Cache HITS stored in email: A,B,C
- Cache MISSES stored in email: 2
- Cache contents afterwards (lru order): 2,C2,C,BC,B,AB,A,1

(notice how '1' (which didn't cache hit) has worked itself to the end)

Running through all of email 3 - we get:
- Cache HITS stored in email: AB,C
- Cache MISSES stored in email: 3
- Cache contents afterwards (lru order): 3,C3,ABC,AB,A,2,C2,BC,B,1

Running through all of email 4 - we get:
- Cache HITS stored in email: ABC   <-- very cool result, note coolness
- Cache MISSES stored in email: 4
- Cache contents afterwards (lru order): 4,ABC,AB,A,3,C3,C,2,C2,BC,B,1

So what happened? The system has realized that ABC is cacheable and is now pointing to that. All subsequent emails with the set-of-lines ABC will reuse the same memory. Note that the disparate lines 1,2,3, and 4 will always be stored separately, but the algorithm will then pick-up any common line-sets later in the email too (if there were any).

This elaborate system to find equal email lines and reuse them drags out compression of the entire flowing email stream down to about 80%. What about 90%?  Well.. one more trick.

Back to LZMA

Remember LZMA from above that we abandoned because it was too slow to happen inline? As you'd guess, the biggest impact it had was on bigger emails. And although it's a CPU hog, we do actually have a few cores laying around. So let's give it one (but seriously, just one).

We setup one core (i.e. thread) to trail behind and scan incoming email for ones that are over some size (say 20k) and re-compress those using the sliding dictionary LZMA we mentioned earlier. While 3 of our cores average 5-10% utilization by receiving, analyzing, and storing incoming email - the 4th core sits at 100% re-compressing emails where it will find benefit. If it gets too far behind, it simply leaps ahead and leaves some compression on the table.

(Note that empirically, LZMA is an order of magnitude faster decompressing than compressing, otherwise that would have been a new problem as it could take too long when someone wanted to read an email)

Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes better, sometimes worse and 2: I realize I'm not exactly sure what "Voila" means, looking that up now).

There are also some other important notes. Storing a byte array in Java costs something. The pointer alone (64bit) is 8bytes. Then there is the byte length field, padding, etc. In other words, I limited the system to never store email lines under 64 bytes. Small lines get concatenated together straight away.

Second, there are more email-idiomatic tweaks we can do to improve the situation. Base64-encoded attachments are effectively un-cacheable, so we pass over those.

Third, although from our cheeky example it may seem like we're finding optimal line sets (i.e. ABC). We're not. We could end up caching ABC and destroying an opportunity for a more optimal BCDXYZ or something. I'm guessing this doesn't happen often but would be an interesting future consideration.

Edit: Wow, sincere thanks to an Anonymous commenter for making me reconsider the above algorithm. I had originally stated it was O(n^2). My first version was indeed O(n^2) (which wasn't written about) and after a few changes it became O(n) and I failed to see that. I find its very easy to find tech reviewers once an article hits Hackers News, before then though - not so much. :)   My apologies for the error.

So for the end-user, this whole diatribe simply means little except their emails are sticking around longer. They have no idea that when they click to read an email we may be LZW or LZMA decompressing tens of byte arrays shared by thousands of emails with a custom-sliding dictionary built by scanning emails that arrived hours ago and then catenating them together so they can be shown on their webpage all in a few milliseconds. And they likely don't care, they're probably too busy signing up for Minecraft or something.

But that's ok. I know.

And if you got this far, you know too.

Ok.. now back to real work. What was I doing again? Oh yeah, writing some slick one-liners in Ruby. No clue on the running times - probably like O(n^4) or something, but if I fiddle with it a bit more - I bet I can cut the character count of the code by half!