Wednesday, August 20, 2008

Benchmarking Talkinator

I really wrote Talkinator for one reason - to play around with some server architectures I had bouncing around in my head. Now, any programmer in almost any language will tell you that it would take just a few hours on a Saturday to write a chat server. Its really not that hard.

If you add in the fact of using Ajax and then emulating sessions over a stateless protocol, and then add in a mildly usable client, you're probably up to a weekend or two. But its still not rocket science.

The Talkinator server in reality took me over a year to write. Partially because it was a weekend project and I surely didn't give every weekend to it. Also, because I ripped out the architecture more than once. At first it was a Java applet that did nice pretty TCP communication. I made a false assumption that applets, maybe after 10 years, didn't suck anymore. But I was wrong. They still suck - and a lot.

Finally, I settled on the ajax client model it is on now. I also really didn't want to use asynchronous/NIO networking. The world seems to still fight me (without doing their own benchmarking) but NIO only makes sense when you can't do Thread-per-connection. On linux, threads (using NPTL and kernel 2.6) are cheap now. And Java IO (synchronous I/O) is measurably faster than Java NIO (asynchronous I/O). In fact, this has nothing to do with Java, I've had many great conversations with C++ folks who find the same thing.

This is where chat becomes mildly interesting. Chat is of course, a "broadcast" networking model. That is, one user sends a message and that message must be propagated to some number of other users. Clearly, many networking applications follow this model, but given that chat is just text - its a rather pure form. It has comparably few details and I could focus just on the data pushing part.

As I said, writing a chat server is easy. However, like many things, writing a chat server that scales is harder. And because of its combinatorial nature, it grows dangerously fast. For this reason, it seemed like a fun project to make a fast broadcast server.

I mentioned that asynchronous/NIO is still suitable when you cant do Thread-per-connection and for Java you soft-max at around 12000 threads (in my testing) and hard-max at 16383 (JVM just wont make more - although some JVMs goto 32767). A chat server is almost the canonical example of a server that has very many, low traffic connections. Its a perfect candidate for NIO.

The trick with synchronous I/O is that a thread will block on every read and every write to the clients. A chat server has the bandwidth to handle 10's of thousands of users, but not if every user equals a thread.

Although synchronous IO reads and writes are blocking, the sequence of conversation events (avoiding the word "protocol" here) in this scenario is clearly defined. Once a client opens a socket and sends a message (on a blocking read), the thread can basically abandon that client. Simply put the client in a data structure and go off and service other clients. The client connection then "hangs open" until some thread comes back and gives it some data (i.e. someone in the same room sent a message). Note that the thread does block on that initial read, but in essence, we know its coming right away. And if it doesn't, we be sure to have a fast socket timeout for this particular wait.

In effect, Talkinator is a synchronous IO system which can support tens of thousands of open connections, running on just a few hundred threads (on a LAN it only needs a few tens of threads).

The weird part to me is that this is a blocking-IO but non-thread-per-connection server.

Every incoming message wakes up a thread to do the data read. Now each incoming message causes the need for (on average lets say) 10 outgoing messages. So sending is a lot of work as compared to reading. But we only wake threads for incoming messages - so once a thread enters the system when you say "Hello there", it has a lot of work to do on the sending side. The threads however, aren't picky about who's sending work they do. A thread you woke might remain active for a long time sending messages in rooms you've never heard of. If one thread incites the need for 100 outgoing messages, a small army of threads spawned from previous incoming messages jump in to help with the sending work. Once the current mound of sending work subsides, threads start going inactive (until more messages arrive).

Messages are queued for each user. When the system isn't too busy, most messages go out by themselves with very low latency. But as the system gets busy, message queues might grow a bit. Thus when a send finally happens, all those messages get sent at once. In effect, its sort of a "dynamic batching" system. We trade latency for throughput (and memory) on a dynamic basis. Room merges cause 30 deep queues but those are rare. Even in busy times I don't see deep queues at all - this implies low latency.

We call many pieces of software out there "engines". But to me, this really is like an engine. The faster it goes - the faster it will go. It is its own chain reaction.

One final note on this engine - I use no blocking queues (in the synchronization sense, not the IO sense). To me, blocking queues are simply a forced thread context switch. I let the thread that did the accept then jump in and do the read and subsequent work. On linux, a thread timeslice is on the order of 10ms before its forcibly swapped out. You can service a lot of messages in 10ms. A context switch isn't hugely expensive but its not free either and I see no reason to force more than will already happen.

Also, a socket accept followed by an immediate read is also probably a context switch. Because chances are, the read data isn't here yet (although we know its coming) so I'm right now playing with the idea of a thread accepting, then going off to help other threads do some sending, then coming back to do its read. Hopefully giving the read a chance to have its data in incoming-buffers first.

I started the project with Tomcat but quickly moved away and wrote my own webserver. Of course, using blocking IO. I found quickly that my little webserver could serve at most about 25000 web requests per second. It turns out that this isn't a limitation of my webserver. Socket accepts in linux (and probably everywhere) are damn expensive.

I expanded Talkinator's webserver to use keepalive (a fundamental feature on every contemporary webserver). In my benchmarks now, on my quad-core desktop the talkinator server can push about 39000 messages per second. Keep in mind this is processed messages as in decoded, packaged, and queued for particular recipients. (I often see claims of people sending millions of messages per second on various non-chat systems - this is quite easy if you don't have an incoming request to parse and are sending one-to-one endpoints - simply fill a 10G buffer with pre-formatted messages, start the timer, and hit "send" - basically your CPU is doing nothing and all you're measuring is the speed of your network).

I'll note that pretty much the entire software stack of Talkinator is custom. The webserver, the RPC, the cometd mechanism, and even the message encoding. Apart from an exercise in performance and me having fun, there was no particularly good reason to re-invent all those wheels. Just "adding another server or 3" would have probably worked equally well.

The ajaxian portion of the system requires that a message is sent per user every so often regardless of activity. If in fact, if we make an aggressive estimate that each user gets a message every 10 seconds, then one talkinator server could handle 390,000 users. This is an interesting thought exercise but its my expectation that the bottleneck of the system will be bandwidth or memory long before we ever run into an actual CPU problem.

I'll specifically note here that this discussion was not about scalability at all. It was about per server throughput. The talkinator scalability model is orthogonal would probably warrant its own blog entry.

This server would be faster if written in C++ or C. But the threading techniques I employed here were only possible (for me I mean) in Java. If this performance seems out-of-whack to you in any way, simply write a server in your favorite language that reads but ignores the incoming request and returns a tiny html webpage as fast as it can. This will give you a feel for the "fastest possible webserver". After that, every feature is slowing you down.

I don't suppose there is a moral here other than this is an experiential report of one of my weekend projects. Seriously, if you still believe asynchrounous/NIO beats synchronous/IO (and it did when linux threads killed the system after just a few hundred were started) you might want to do some research. Threaded socket programming is back.. and with a vengeance.

9 comments:

Anonymous said...

why not writing a twitter killer then? with 50 times less hardware. you seem like the correct guy ;)

Mariano said...

I don't really use any of those two services, but the blogentries detailing how you built them are very educational.

Anonymous said...

out of curiosity, did you consider using erlang?

Chris Dew said...

I can understand multiplexing the output to sockets across a small number of threads.

How do you handle synchronously waiting for input from clients, if you only have 100 threads for, say, 10,000 clients? Even if the socket timeout is a second, that's a minute of latency when sending a message.

Is each read only done in response to writing data to a socket?

A diagram, or pseudocode, would help...

Anonymous said...

I hope you read the comments Paul:

I have, probably like many other people, been using Mailinator for signup at various forums and sites to lessen the risk of future spam on my regular email.

The addition of alternative adresses for mailinator is a great way to add some more privacy to that usage - I can now re-use one mailinator (alternative) adress on many sites without worrying that a malicious admin on any one of those sites could read in on anything in my mailinator inbox.

At least I thought so. But then I realized that I had from my real mailinator inbox page been clicking on the "activate this account" links in the emails from these sites and forums. My question: when clicking like that, does the target site get the adress to my real inbox on mailinator as a referrer?

Gilles Dubuc said...

Very interesting strategy, as usual. I'm doing some research about solving a similar problem at the moment, which would be "real-time" notifications on a website (eg. "you have a new message in your inbox"). What appeals to me with low-latency solutions is to keep the same state on various tabs a user might have open. Especially since you might batch push messages by IP address.

I have an important question though, in your tests didn't you run into a limit in the number of sockets open at the same time? In my situation (which in this respect would be different from talkinator) the sockets will be very idle *but* there will be one socket kept open per user*number of tabs. I've tried to research the maximum number of sockets that could be open at the same time but I couldn't find any reliable info on this.

Someone mentioned erlang which is also an option I'm considering to tackle this problem, did you think about using it for talkinator at all?

Paul Tyma said...

Sockets: I found no limit - yet. But I haven't tried to push it. I have certainly gotten "Too many files open" on linux but thats almost expected these days and easily fixed with some config changes.

Erlang: I never considered Erlang because of its performance (disclaiming: I haven't tested this myself, but googling "Erlang performance benchmark" seems pretty clear).

Erlang allows threads to only consider their own world and send other threads (i.e. other threads) messages as necessary.

By default, Java lives in an opposite world - all threads in the same heap. However, You can code Java very often in Erlang style. Judicious use of volatility within Java can allow you manually (and surgically) trigger memory fences. That and idempotency can get a lot done without any synchronization.

LuSiD said...

Paul,

I have no idea what you are talking about, because I don't code.

I am a visual artist, VJ, and designer, and do inflatables.

I am just commenting to say that I am a fan of yours. Mailinator rocks, alway has, and I am looking fwd to checking out talkinator. I wasn't surprised when I read you were working with Google.

Later.

john said...

I'm currently writing a chat server and am looking at using xSockets, which uses NIO. Did you consider using that? I don't know what state it was in a year ago, might not have been usable.

How do you launch your server threads? i.e. how do you determine a new thread is needed?

Im hungry for more info.