Wednesday, June 2, 2010

How I sped up my server by a factor of 6

(with one linux command)

Subtitle: IO-bound and CPU-bound applications are common - here's maybe a "memory contention bound" app


I've written a lot of servers. If you've read the architecture of mailinator or benchmarking talkinator or "blocking faster than non-blocking" you probably already have that idea.

I was working on a new server infrastructure recently I needed for a new product. The server has a novel internal architecture to me that seeks to never voluntarily block threads or cause context switches - but at the same time is highly multi-threaded and creates new ones whenever it needs to.

The server's nature isn't totally important but you can think of it along the lines of a twitter server. It gets messages from one place, collates them, and sends them to (potentially many) other places. I suppose it could be used directly as a "twitter server" but of course, they already have those and my start up focus is rather orthogonal to that.

I usually write servers in Java on linux so keep that in mind for the ideas raised here. At Google, I worked on servers in C++ and Java. And especially after that experience, I'll stay with Java. To hopefully avoid language wars, the #1 reason I write in Java is that I'm way better at Java than I am at C++ or Ruby or Python or whatever. In other words, it was a practical (and definitely reasonable) choice in this case.

Also, the benchmarking here isn't contrived - I'll save the details, but I truly am asking the server to do precisely what it would do in the wild except at a higher rate.

Now, given the nature of the server, I needed 2 things to help benchmark. A "producer" (a system that produces messages) and a "consumer" (a system that consumes them). The producer produces a sequenced message set, and the consumer verifies it receives every message intact *and* in-order.

I chose message sizes at somewhat less than twitter size (msg size = 100 bytes) to induce a reasonable amount of activity in the server (server purpose is not necessarily large messages and large messages tend to just start saturating bandwidth without causing contention or server cpu activity). My custom protocols add a few bytes and TCP has a 40byte header - so overall I'm guessing that I'm using an average 200 byte messages counting everything.

Producer  ->  Server  ->  Consumer


The server system is pretty flexible and in effect, location transient. That is, a server process might live on one machine today and find itself on another machine tomorrow. Or a few new server processes may join the mesh. Needless to say, the idea is to create a scalable, flexible system.

One ramification of that is that it's possible and even very likely that some producers, servers, and consumers could be on the exact same machine. Socket communication over localhost changes a lot of assumptions. Firstly, TCP fusion can occur to reduce overhead significantly. Secondly, there is not an effective bandwidth limitation - there isn't a real network involved, it's a virtual (and fast) one.

Running all on the same server, I would expect this benchmark to be CPU-bound given that I/O is now virtual and effectively a CPU operation. Like I said, loopback is a real scenario I need but I initially think I benchmarked on a single machine mostly as a matter of convenience.

So, with all 3 processes running on the same machine, I ran the test. The CPU was a Intel Core I7 920. It has 4 hyperthreaded cores that act like 8. (note: I've tested all of this on a Core2Quad (non-I7) cpu and CoreDuos and results are effectively the same)

Here's the image of htop during the test with all 3 processes running. (if you run linux and use top, upgrading to htop is highly recommended).



Don't get hung up on reading the numbers or text. The graph in the upper left gives you a sense of how busy the CPUs are. In this picture, they're all at least a little busy. That's no surprise given we're running many tens of threads.

Notice that none of the cores are "maxed". This benchmarked showed the server to receive and re-send about 120,000 messages per second (that's 120k from producer to server and the same 120k from server to consumer - so 240k "messages transmissions" but only 120k messages - this would be analogous to queries-per-second for a webserver).

Why aren't the active cores maxed?

It occurred to me that I was running 3 CPU-bound processes on the same machine and that the processes might be stepping on each other's toes. It's possible that if the server is running on core 4 one second and the producer is running there the next, the level 1 cache of that core could be ruined for the server the next time it ambles over there.

The simple solution was to assign cores to the processes. In linux, you do this with command: taskset

Taskset is followed by a bitmask value to assign CPUs - so I ran (in separate xterms):


taskset 0x3F java manyfeedServer
taskset 0x40 java theProducer
taskset 0x80 java theConsumer


and off they went. The server is highly multithreaded so I gave it 6 cores. The producer and consumer each got one.

The result? 270,000 messages/second! Wow. If my cache assumption was right (and I'm not claiming to know if it was) - it REALLY worked. One way or another though - something worked - we got better than a 2x speedup.

So you might be thinking the moral of this story is:

1) If you're in linux, install HTOP
2) If you're sharing a computer amongst CPU-bound processes, isolating the processes might (and very well "might not") be beneficial.

And ok, those are fine ideas. But what bugged me was that htop showed me that no CPUs were maxed yet. Again, what was slowing my application down ahead of CPU power?

I then tried limiting the server to 2 (hyperthreaded) cores. (I also tried keeping the producer and consumer on the same hyperthreaded "core" and given that I had cores to spare, also tried separating them, but the result was the same).


taskset 0x03 java manyfeedServer


Now, we get 530,000 messages/second. Nice. Reducing the cores from 6 to 2 nearly doubles our msgs/sec again. Here's the htop now:



You can see that cores 1 & 2 are plenty busy. Cores 3,4,5,6 are idle as expected. Core 7 (the producer) is pretty busy and so is Core 8 (the consumer).

530k msgs/second is nothing to sneeze at but.. um.. again, no core is maxed. Why - not? What's the bottleneck?

Obviously.. the last test is to throw the whole server on ONE cpu. Apart from the fact that I very purposely and meticulously coded this server be highly multi-threaded, fewer CPUs seem to make it happier. I am a conservative thread-safety-inducer. That is, I'm only really dangerous with firearms and synchronized blocks. But I'm by no means afraid to use the latter when needed.


taskset 0x01 java manyfeedServer


And finally, we hit 100% utilization on CPU 1 at 790,000 messages per second.



Here's the TL;DR of this blog post:

Some multi-threaded java applications apparently run faster on 1 core than on multiple cores.

Note the very non-committal phrasing in an attempt to make this a rather defensible statement. A graph showing the ManyFeed server's performance per number of cores:



So.. if you are a CPU or JVM guru and want to tell me your thoughts on what's going on, I'd love to hear it.

A long time ago, I benchmarked a bunch of different CPU types on doing contended memory stores. To my surprise, a little single-core Dothan processor did amazing in a few tests and I had no idea why. The discussion above is really the same idea. (I've retested that benchmark by the way on the Core I7 - and it has very different characteristics - probably a testament to the I7's new memory architecture).

Note that if you have a pre-I7 Intel multi-core cpu, you can reproduce my "1 core runs faster" results here using the code in that blog post (note: this code *is* a micro-benchmark, built to create a situation that causes extreme and unrealistic threading competition - for more details read that blog post). This gives a clearer picture of what's going on. My theory is that with 2 cores and 2 threads - each gets a core and every store operation competes for the memory bus in order to do its operation. Threads spend lots of time "competing" and less time doing real work. On one core (and 2 threads) - only one thread runs at a time - so every time it tries to store in memory, it can.

The Core-I7 is different (and maybe AMD cpus too) - it doesn't suffer from this memory contention problem at all. (Although - given that my server still runs (way way) better on 1 cpu even on I7, then maybe the contention is elsewhere in my server).

Here's that benchmark running for Static Volatile Store on a Core Duo CPU with 2 cores and again with 1 core. No performance loss on just one core, otherwise same exact run.



(numbers along the bottom are the number of threads in that run - it's not until we have 2 threads that contention creeps in and hurts us).

Oh.. and how does my little nifty manyfeed server do on a real network? Testing on 1 CPU on my local 1Gbps LAN I get about 300,000 messages per second.

With some quick (and hopefully not silly) math, this looks about right.

1Gbps = 1,000,000,000 bits/sec = 125,000,000 bytes/sec

125,000,000 bytes/sec divided by 200 bytes/message = 625,000 messages.

half for producer send and half for the server send (to consumer) = 312,500 each

On a real LAN, the server saturates bandwidth (i.e. becomes IO-bound) and that's no surprise (i.e. until I have access to 10Gbps networks, I don't need to speed up my server any more).

Interestingly I tested this idea of "1-core'ing-it" using apache tomcat and apache bench and saw no improvement. I also saw far fewer qps (loading a single, tiny, web-page over and over) and 100% core utilization even on multi-core. It'd be my guess that tomcat isn't contention-bound but truly cpu-bound.


In other words, don't follow this path without testing this yourself. The good news is that its extremely easy to try and you don't even have to change a line of code. Just try "1-core'ing" it.