Wednesday, August 20, 2008

Benchmarking Talkinator

I really wrote Talkinator for one reason - to play around with some server architectures I had bouncing around in my head. Now, any programmer in almost any language will tell you that it would take just a few hours on a Saturday to write a chat server. Its really not that hard.

If you add in the fact of using Ajax and then emulating sessions over a stateless protocol, and then add in a mildly usable client, you're probably up to a weekend or two. But its still not rocket science.

The Talkinator server in reality took me over a year to write. Partially because it was a weekend project and I surely didn't give every weekend to it. Also, because I ripped out the architecture more than once. At first it was a Java applet that did nice pretty TCP communication. I made a false assumption that applets, maybe after 10 years, didn't suck anymore. But I was wrong. They still suck - and a lot.

Finally, I settled on the ajax client model it is on now. I also really didn't want to use asynchronous/NIO networking. The world seems to still fight me (without doing their own benchmarking) but NIO only makes sense when you can't do Thread-per-connection. On linux, threads (using NPTL and kernel 2.6) are cheap now. And Java IO (synchronous I/O) is measurably faster than Java NIO (asynchronous I/O). In fact, this has nothing to do with Java, I've had many great conversations with C++ folks who find the same thing.

This is where chat becomes mildly interesting. Chat is of course, a "broadcast" networking model. That is, one user sends a message and that message must be propagated to some number of other users. Clearly, many networking applications follow this model, but given that chat is just text - its a rather pure form. It has comparably few details and I could focus just on the data pushing part.

As I said, writing a chat server is easy. However, like many things, writing a chat server that scales is harder. And because of its combinatorial nature, it grows dangerously fast. For this reason, it seemed like a fun project to make a fast broadcast server.

I mentioned that asynchronous/NIO is still suitable when you cant do Thread-per-connection and for Java you soft-max at around 12000 threads (in my testing) and hard-max at 16383 (JVM just wont make more - although some JVMs goto 32767). A chat server is almost the canonical example of a server that has very many, low traffic connections. Its a perfect candidate for NIO.

The trick with synchronous I/O is that a thread will block on every read and every write to the clients. A chat server has the bandwidth to handle 10's of thousands of users, but not if every user equals a thread.

Although synchronous IO reads and writes are blocking, the sequence of conversation events (avoiding the word "protocol" here) in this scenario is clearly defined. Once a client opens a socket and sends a message (on a blocking read), the thread can basically abandon that client. Simply put the client in a data structure and go off and service other clients. The client connection then "hangs open" until some thread comes back and gives it some data (i.e. someone in the same room sent a message). Note that the thread does block on that initial read, but in essence, we know its coming right away. And if it doesn't, we be sure to have a fast socket timeout for this particular wait.

In effect, Talkinator is a synchronous IO system which can support tens of thousands of open connections, running on just a few hundred threads (on a LAN it only needs a few tens of threads).

The weird part to me is that this is a blocking-IO but non-thread-per-connection server.

Every incoming message wakes up a thread to do the data read. Now each incoming message causes the need for (on average lets say) 10 outgoing messages. So sending is a lot of work as compared to reading. But we only wake threads for incoming messages - so once a thread enters the system when you say "Hello there", it has a lot of work to do on the sending side. The threads however, aren't picky about who's sending work they do. A thread you woke might remain active for a long time sending messages in rooms you've never heard of. If one thread incites the need for 100 outgoing messages, a small army of threads spawned from previous incoming messages jump in to help with the sending work. Once the current mound of sending work subsides, threads start going inactive (until more messages arrive).

Messages are queued for each user. When the system isn't too busy, most messages go out by themselves with very low latency. But as the system gets busy, message queues might grow a bit. Thus when a send finally happens, all those messages get sent at once. In effect, its sort of a "dynamic batching" system. We trade latency for throughput (and memory) on a dynamic basis. Room merges cause 30 deep queues but those are rare. Even in busy times I don't see deep queues at all - this implies low latency.

We call many pieces of software out there "engines". But to me, this really is like an engine. The faster it goes - the faster it will go. It is its own chain reaction.

One final note on this engine - I use no blocking queues (in the synchronization sense, not the IO sense). To me, blocking queues are simply a forced thread context switch. I let the thread that did the accept then jump in and do the read and subsequent work. On linux, a thread timeslice is on the order of 10ms before its forcibly swapped out. You can service a lot of messages in 10ms. A context switch isn't hugely expensive but its not free either and I see no reason to force more than will already happen.

Also, a socket accept followed by an immediate read is also probably a context switch. Because chances are, the read data isn't here yet (although we know its coming) so I'm right now playing with the idea of a thread accepting, then going off to help other threads do some sending, then coming back to do its read. Hopefully giving the read a chance to have its data in incoming-buffers first.

I started the project with Tomcat but quickly moved away and wrote my own webserver. Of course, using blocking IO. I found quickly that my little webserver could serve at most about 25000 web requests per second. It turns out that this isn't a limitation of my webserver. Socket accepts in linux (and probably everywhere) are damn expensive.

I expanded Talkinator's webserver to use keepalive (a fundamental feature on every contemporary webserver). In my benchmarks now, on my quad-core desktop the talkinator server can push about 39000 messages per second. Keep in mind this is processed messages as in decoded, packaged, and queued for particular recipients. (I often see claims of people sending millions of messages per second on various non-chat systems - this is quite easy if you don't have an incoming request to parse and are sending one-to-one endpoints - simply fill a 10G buffer with pre-formatted messages, start the timer, and hit "send" - basically your CPU is doing nothing and all you're measuring is the speed of your network).

I'll note that pretty much the entire software stack of Talkinator is custom. The webserver, the RPC, the cometd mechanism, and even the message encoding. Apart from an exercise in performance and me having fun, there was no particularly good reason to re-invent all those wheels. Just "adding another server or 3" would have probably worked equally well.

The ajaxian portion of the system requires that a message is sent per user every so often regardless of activity. If in fact, if we make an aggressive estimate that each user gets a message every 10 seconds, then one talkinator server could handle 390,000 users. This is an interesting thought exercise but its my expectation that the bottleneck of the system will be bandwidth or memory long before we ever run into an actual CPU problem.

I'll specifically note here that this discussion was not about scalability at all. It was about per server throughput. The talkinator scalability model is orthogonal would probably warrant its own blog entry.

This server would be faster if written in C++ or C. But the threading techniques I employed here were only possible (for me I mean) in Java. If this performance seems out-of-whack to you in any way, simply write a server in your favorite language that reads but ignores the incoming request and returns a tiny html webpage as fast as it can. This will give you a feel for the "fastest possible webserver". After that, every feature is slowing you down.

I don't suppose there is a moral here other than this is an experiential report of one of my weekend projects. Seriously, if you still believe asynchrounous/NIO beats synchronous/IO (and it did when linux threads killed the system after just a few hundred were started) you might want to do some research. Threaded socket programming is back.. and with a vengeance.