Monday, February 18, 2008

Kill the myth please. NIO is *not* faster than IO

I'm giving 3 talks at the Software Developer's Conference West in Santa Clara, CA on March 3-7.

One of them is a tutorial on how to interview in Silicon Valley (fun stuff).

Another is how to write high-performance servers in Java. Specifically, why NIO is really not the best way anymore (post linux 2.6 kernel and NPTL) and multithreaded I/O is the new old-way of doing things. Its a fun talk and largely discusses the internals of Mailinator and how it runs a few thousand simultaneous threads without breaking a sweat.

On top of that, as it turns out, in pure throughput, IO smokes NIO in all tests I tried. And I'm not alone - Rahul Bhargava of Rascal Systems did a very nice analysis of this and posted it, sadly, in some forums at theserverside.com.

The post is solid gold and I'm posting it below just for posterity as I'd hate to see those forums come down some day and I lose access to that post. Incidentally, I disagree with Rahul that "fewer threads are easier to debug". It only takes 2 to make a deadlock or race condition.

The original URL for this post is:
http://www.theserverside.com/discussions/thread.tss?thread_id=26700

And like I said, it was written by Rahul Bhargava of Rascal Systems. Here it is:

-------------------------------------------------

I have been benchmarking Java NIO with various JDKs on Linux. Server is
running on a 2 CPU 1.7 GHz, 1GB RAM, Ultra160 SCSI 36GB disk

With Linux kernel 2.6.5 (Gentoo) I had NPTL turned on and support for
epoll compiled in. The server application was designed to support
multiple disptach models :

1. Reactor with Iterative Disptach with multiple selector threads. Essentially
the accepted connections were load-balanced between varying number of
selector threads. The benchmark then applied a step function to experimentally
determine the optimal # of threads and connection per selector ratio.

2. Also a simple concurrent blocking disptach model was supported. This is
essentially a reader thread per connection model.

Client application opens concurrent persistent connections to the server
and starts blasting messages. Server just reads the messages and does
basic un-marshalling to ensure message is ok.

Results were interesting:

1. With NPTL on, Sun and Blackwidow JVM 1.4.2 scaled easily to 5000+ threads. Blocking
model was consistently 25-35% faster than using NIO selectors. Lot of techniques suggested
by EmberIO folks were employed - using multiple selectors, doing multiple (2) reads if the first
read returned EAGAIN equivalent in Java. Yet we couldn't beat the plain thread per connection model
with Linux NPTL.

2. To work around not so performant/scalable poll() implementation on Linux's we tried using
epoll with Blackwidow JVM on a 2.6.5 kernel. While epoll improved the over scalability, the
performance still remained 25% below the vanilla thread per connection model. With epoll
we needed lot fewer threads to get to the best performance mark that we could get out of NIO.

Here are some numbers:

(cc = Concurrent Persistent Connections, bs = Is blocking server mode on Flag,
st = Number of server threads, ct = Connections handled per thread,
thruput = thruput of the server )

cc, bs,st,ct, thruput
1700,N,2,850,1379
1700,N,4,425,1214
1700,N,8,212,1240
1700,N,16,106,1140
1700,N,32,53,1260
1700,N,64,26,1115
1700,N,128,13,886
1700,N,256,6,618
1700,N,512,3,184
1700,Y,1700,1,1737

As you can see the last line indicates vanilla blocking server (thread per connection)
produced the best thruput even with 1700 threads active in the JVM.

With epoll, the best run was with 2 threads each handling around 850 connections in
their selector set. But the thruput is below the blocking server thruput by 25%!

Results shows that the cost of NIO selectors coupled with OS polling mechanism (in
this case efficient epoll VS selector/poll) has a significant overhead compared to
the cost of context switching 1700 threads on an NPTL Linux kernel.

Without NPTL of course it's a different story. The blocking server just melts at 400 concurrent
connections! We have run the test upto 10K connections and the blocking server outperformed
NIO driven selector based server by same margin. Moral of the story - NIO arrives at the scene
a little too late - with adequate RAM and better threading models (NPTL), performance gains
of NIO don't show up.

Sun's JVM doesn't support epoll() so we couldn't use epoll with it. Normal poll() based
selector from Sun didn't perform as well. We needed to reduce the number of connections
per thread to a small number (~ 6-10) to get comprabale numbers to epoll based selector.
That meant running lot more selector threads kind of defeats the purpose of multiplexed IO.
The benchmarks also dispell the myth created by Matt Welsh et al (SEDA) that a single
threaded reactor can keep up with the network. On a 100Mbps ethernet that was true: network
got saturated prior to server CPUs but with > 1Gbps network, we needed multiple selectors
to saturate the network. One single selector's performance was abysmal (5-6x slower than
concurrent connections)

For application that want to have fewer number of threads for debuggability etc, NIO may be
the way to go. The 25-35% performance hit may be acceptable to many apps. Fewer threads
also means easier debugging, it's a pain to attach a profiler or a debugger to a server hosting
1000+ threads :-) . Bottom line with better MT support in kernels (Linux already with NPTL), one
needs to re-consider the thread per connection model

Rahul Bhargava
CTO, Rascal Systems

Thursday, February 7, 2008

Mailinator allows 831375028249471605232943589 trillion different email addresses. Please choose a good one.

Thats just an approximation really (allowing for 36 possible characters (there's more really) and email names of up to 25 characters).

Regardless, there's a lot of possibilities. So just a friendly reminder - please use Mailinator !!! But choose an offbeat name.
There's plenty to choose from :)

Update: math error, number was too low

Friday, February 1, 2008

More on the "pen1s" post

In my post about searching 185 mailinator emails per second in Mailinator, I think I failed to adequately explain the goal of it. Its intent was to document a thought process - it was not intended nor tried to be a taxonomy of search algorithms (I gave some very nice links if you're looking for that). I really don't think the world needs another taxonomy of string searching algorithms, or at least not one written by me.. Nor does it need another academic analysis of any existing algorithms - there are plenty of those out there (and I linked Wikipedia in that post too where appropriate to describe specific algorithms).

Over a long history of reading about algorithms, I realized that I rarely found many analysis of algorithms from an engineer viewpoint. I am a recovering academician, but engineering is where you get to tinker with things. I've read many times Boyer-moore's best case is O(n/m) - but whats the *real* case? The usual case? How can that influence me in using any of these algorithms? Many searching algorithms display their worst case when you search for stuff like "AAAA" and your search string is "AAAABAAAABAAAAAB" - well guess what, I don't often search for that kind of thing in practice.

Also many people don't truly understand Big O notation (my great aunt made incredible homemade donuts, but ask her about algorithmic analysis? not so much). Considering some examples in the concrete, especially realistic cases can provide insight. For example, tell your favorite non-algorithmic person that there is an algorithm called "bubble sort", and it sorts numbers for you. But in order to sort 100 numbers, it has to look at each of those numbers on the order of 100 times. Effectively it must do on the order of 10,000 comparisons to sort 100 numbers - they'd look at you like you're crazy.

That post (and future ones in the series if i write them) attempts to detail an evolutionary engineering thought process on designing an algorithm for a specific purpose. The article started with the canonical "naive" but undeniably correct solution. When considering any problem, correctness is in my opinion, an excellent place to start.

I did not "forgot" to mention any other searching algorithms. They simply didn't not yet arise in this thought process (several people mentioned Aho-Corasick, which I didnt mention by name but certainly alluded to its features when I mentioned the finite state machines of regular expressions). A Trie is a basis for AC, but its definitely not AC (which others seemed confused about).

As I said, we're by far mostly interested in the "usual case" running time (which future installments will empirically demonstrate). I was also pretty clear in the last installment that what I had presented so far was *not* was I was using in Mailinator. Some people seemed to miss that (they people seem so excited to comment, they must have not actually read much of the post). I'm not sure how far this thread of blog posts will go - again, I'm documenting my thinking.

And very importantly, I was also clear (and trying to be clearer) that I am in some sense over-engineering this problem. I know that - that is basically a goal here. I likely don't need the performance I ultimately will end up with. This is not science - its fun.

So to concisely re-state the purpose of the blog entry - I'm documenting the thought process that coincided with evolving needs for an over-engineered multi-term string searching algorithm. Thought processes almost by nature go down wrong avenues - which isn't bad, its educational. The interesting parameters are such that we have a hundred or so words (expecting that to grow) and rarely find any of them. Engineering analysis is limited as compared to academic analysis in that you give specific cases and values - you are really only able to look at one instance of a big problem. But again, that all I'm trying to do here.

Thus, If you're looking for a in-depth explanation of Boyer-Moore (which some people who didn't read the sentence in the last post where I said I wasn't providing that) seemed to want, or you want a nice taxonomy of effective string searching algorithms, then you should probably go read something like that. Because this.. wasn't that.

As other interesting facts, I chose 185 emails/second because it was a relatively representative rate for Mailinator. I truly don't know Mailinator's max rate - I've seen 1200 emails/second and I've seen 10 emails/second. The number 185 was chosen as representative, not as the only possibility.

Finally, as I said, I'm aware of algorithms such as Rabin-Karp, Aho-Corasick, and Wu-Manber (in fact, I see Udi most days at work). From a storytelling standpoint, it wouldn't be much of a tale if I didn't get an interesting result in the end. Aho-Corasick is linear on the search space and Wu-Manber is nicely sub-linear. Not to give it away, but Mailinator's algorithm is also sub-linear. It isn't Wu-Manber and all truth be told, Wu-Manber is notably better in memory use and worst case running time.

If you're asking then why should you use such a thing? I'm not advocating you should. Again, the idea here isn't so much the final algorithm (which is really just the punchline). Its really way more about how we get there.