Sunday, March 30, 2008

How fast is Java Volatile? or Atomic? or Synchronized?

This is a question I've wondered awhile. And I've gone ahead and made a stab at finding out. One question that I had was I had always heard that AMD chips have an on-cpu memory controller and Intel's don't (at least not until they're upcoming Nehalem line). Now I'm first to admit I'm talking above my knowledge (please correct profusely).

But the idea was that some operations (notably I've heard volatile reads which I didnt test here) are very similar regardless of volatility.

I spent a few hours making up a benchmark. I'm publishing the benchmark below for one important reason - to get it right. This is big disclaimer: Don't trust these results - at least with any precision. I do conclude that contended synchronization is expensive - thats a pretty expected result so hopefully we're somewhere on the right track.

However, Java benchmarks are very hard to right. So hard in fact that they do a hardness-overflow and end up looking easy - which is precisely what makes them so hard to get right. Hence - given that only I have seen this benchmark, its pretty likely to be broken some (and hence why its published here for public review to help fix).

With all that...

What I've tested here are storage operations. Specifically:

Storage to a local variable (mostly used as a control)
Storage to an instance variable
Storage to a static variable
Storage to a static volatile variable
Storage to a static AtomicLong variable
Storage to a static variable protected by syncrhonization

I tested this on 3 different CPUs, all with JDK1.6 and variously on Linux and Windows. One very important point is do *not* compare the absolute numbers in the graphs directly. In no case are comparing the same CPU across 2 operating systems. The CPUs are wildly different in capability.

So we can explain more with a graph:

This is the run on a dual-core dual-Opteron machine. Note the blue-bar is local variable storage. As you'd expect it simply destroys everything else. This is very probably because the smart JVM went in and realized it could optimize the assignment out of the loop. Note the blue-bar just gets happier and happier as we add cores. The threads are completely unaffected by each other and we get actual linear increases in power. Even at 5 threads (note we have 4 cores) we get some speedup.

The orange and yellow bars are instance variable and static variable stores respectively. Some level of indirection or inability to optimize jumps in and makes things a bit more sane. Surprising we get little or no improvement as we add threads.

The three thread-visible methods of store are so slow they barely show up. This brings up an important point, let's assume you wrote an application and made it correctly thread safe. Using synchronization or volatiles or whatever. You *cannot* optimize away synchronization in the name of performance. It simply can't be done. (conversely, if you could then it wasn't correctly thread-safe to start). You can however, at times, replace one type of synchronization primitive with another.

That being said, lets take a closer look at the same graph with just the barrier memory store operations.

Volatile and Atomic are neck-and-neck. Synchronization is still a big loser, especially after we contend giving it two threads.

One other anecdote - while running the non-memory-barrier benchmarks, my CPU meter showed 100% user space usage. With these benchmarks however, it went to 30-40% kernel cpu usage.

Ok, jumping to the Core2Quad Extreme CPU. Definitely a faster processor but with a different memory architecture.

Local store again goes flying. Oddly the instance store doesn't do much but the static store increases nicely with the cores. And, once again, we can't even fricken see the barriered writes.. so here they are.

Look how great the JVM does knowing that just one thread is out there. It can really optimize the code to elide or at least marginalize the impact. Surprisingly, synchronization still gets nailed across the board.

Note that according to what I have so far, a synchronized static store is something 50 times slower than a simple static store. And 10 or so times slower than a volatile static store.

One more.. and its a fun one. Windows XP running on a Dothan. (Funny, I realized I hadn't directly known that Dothan was single core, but I just assumed it after I saw the graphs.)

Crazy. Add a few threads and performance doesn't even budge. Of course, with only one core a system can completely avoid all the complex architecture keeping cores in sync. Why its local writes are worse as compared to instance beats me.

Also, although I said don't compare graphs, this little single core Dothan beats the Core2Quad in all barrier-ed writes after 2 threads. Note that on the local writes, the Core2Quad is something like 50 time faster. But even on the simple static volatile store - at 2 threads, the Dothan is now twice as fast.

So. Pardon my informalness here - I'm actually quite expecting feedback to break this code in grand ways and force me to redo all the runs and rewrite all the text (which I'll be happy to do if we get a clean bench out of this). I have no intention of making very specific claims once this benchmark is firmed up - but I would like to have a "sense" of the costs of these different operations.

Source code for the Benchmark.

Thursday, March 27, 2008

Introducing: Alternate Inbox Names

Mailinator helps thousands of people every day avoid spam. It's been a great success.

One point that always sort of bothered me however was that by giving out a Mailinator address, you were basically telling people not only how to email you, but also how they can check *your* email. If I say "Email me at binkypop@mailinator.com", the whole world knows where to send and read my email!

Well, this is no longer a problem. Today we added "Alternate Inboxes" to Mailinator. We've always had alternate domains, but this is quite different.

Now, if you check any inbox on Mailinator (say, binkypop), listed on the page is the alternate inbox name. All alternate inbox names start with "M8R-". For example, for binkypop, the alternate inbox name is M8R-yg1hkn@mailinator.com.

So simply put, if you email binkypop@mailinator OR M8R-yg1hkn@mailinator.com it doesn't matter. Both emails would end up in the binkypop inbox (and nothing in the
M8R-yg1hkn inbox). The only way to find out an alternate inbox name is to check the inbox here first - thus if you give out the alternate address, there is no way for anyone to guess that it actually goes to binkypop.

So.. pick yourself a favorite Mailinator inbox name (make it big, long, and hard-to-guess please!), go check the inbox to find out the alternate inbox name, and hand it out all over the web (of course, alternate domains work on alternate inboxes too!).

Then check the box (or RSS it) knowing only you know thats the real destination. Also, keep in mind - you don't have to remember the alternate inbox name ever - you just have to remember the real destination address. You can always go to Mailinator and get the alternate anytime just by checking the inbox. And of course, the regular think-up-on-the-fly-and-use Mailinator you know and love hasn't changed a bit. This feature is purely optional (I'm surprised how fast people started using it!)

(This feature is in beta and we might end up changing the alternate address scheme some)

This is cool. Alternate inboxes really do up the ante in Mailinator's spam fighting abilities. Enjoy !



Tuesday, March 4, 2008

Fun Mailinator server stats

All averages have some fuzziness in them. Max numbers were observed but are unlikely the actual max.

100% custom SMTP Server written in Java using blocking I/O, multiple (at times thousands) of threads, and liberal use of non-blocking data structures.
------------------------------------------------------------

Average daily incoming emails: 6.5 million
Average daily incoming bandwidth: 18.4G
Average # of threads running at any time: ~450
load average: (uptime cmd) 0.04, 0.05, 0.00

Average email size: 2194 bytes
Ram devoted to email storage (compressed): 400M-500M
Average number of stored messages: 218,675
Average number of active inboxes: 94.575
Average memory allocated to Java VM: 880M

M
ax observed emails in one hour: 1,000,085
Max observed emails in one second: 1,274