Wednesday, June 2, 2010

How I sped up my server by a factor of 6

(with one linux command)

Subtitle: IO-bound and CPU-bound applications are common - here's maybe a "memory contention bound" app


I've written a lot of servers. If you've read the architecture of mailinator or benchmarking talkinator or "blocking faster than non-blocking" you probably already have that idea.

I was working on a new server infrastructure recently I needed for a new product. The server has a novel internal architecture to me that seeks to never voluntarily block threads or cause context switches - but at the same time is highly multi-threaded and creates new ones whenever it needs to.

The server's nature isn't totally important but you can think of it along the lines of a twitter server. It gets messages from one place, collates them, and sends them to (potentially many) other places. I suppose it could be used directly as a "twitter server" but of course, they already have those and my start up focus is rather orthogonal to that.

I usually write servers in Java on linux so keep that in mind for the ideas raised here. At Google, I worked on servers in C++ and Java. And especially after that experience, I'll stay with Java. To hopefully avoid language wars, the #1 reason I write in Java is that I'm way better at Java than I am at C++ or Ruby or Python or whatever. In other words, it was a practical (and definitely reasonable) choice in this case.

Also, the benchmarking here isn't contrived - I'll save the details, but I truly am asking the server to do precisely what it would do in the wild except at a higher rate.

Now, given the nature of the server, I needed 2 things to help benchmark. A "producer" (a system that produces messages) and a "consumer" (a system that consumes them). The producer produces a sequenced message set, and the consumer verifies it receives every message intact *and* in-order.

I chose message sizes at somewhat less than twitter size (msg size = 100 bytes) to induce a reasonable amount of activity in the server (server purpose is not necessarily large messages and large messages tend to just start saturating bandwidth without causing contention or server cpu activity). My custom protocols add a few bytes and TCP has a 40byte header - so overall I'm guessing that I'm using an average 200 byte messages counting everything.

Producer  ->  Server  ->  Consumer


The server system is pretty flexible and in effect, location transient. That is, a server process might live on one machine today and find itself on another machine tomorrow. Or a few new server processes may join the mesh. Needless to say, the idea is to create a scalable, flexible system.

One ramification of that is that it's possible and even very likely that some producers, servers, and consumers could be on the exact same machine. Socket communication over localhost changes a lot of assumptions. Firstly, TCP fusion can occur to reduce overhead significantly. Secondly, there is not an effective bandwidth limitation - there isn't a real network involved, it's a virtual (and fast) one.

Running all on the same server, I would expect this benchmark to be CPU-bound given that I/O is now virtual and effectively a CPU operation. Like I said, loopback is a real scenario I need but I initially think I benchmarked on a single machine mostly as a matter of convenience.

So, with all 3 processes running on the same machine, I ran the test. The CPU was a Intel Core I7 920. It has 4 hyperthreaded cores that act like 8. (note: I've tested all of this on a Core2Quad (non-I7) cpu and CoreDuos and results are effectively the same)

Here's the image of htop during the test with all 3 processes running. (if you run linux and use top, upgrading to htop is highly recommended).



Don't get hung up on reading the numbers or text. The graph in the upper left gives you a sense of how busy the CPUs are. In this picture, they're all at least a little busy. That's no surprise given we're running many tens of threads.

Notice that none of the cores are "maxed". This benchmarked showed the server to receive and re-send about 120,000 messages per second (that's 120k from producer to server and the same 120k from server to consumer - so 240k "messages transmissions" but only 120k messages - this would be analogous to queries-per-second for a webserver).

Why aren't the active cores maxed?

It occurred to me that I was running 3 CPU-bound processes on the same machine and that the processes might be stepping on each other's toes. It's possible that if the server is running on core 4 one second and the producer is running there the next, the level 1 cache of that core could be ruined for the server the next time it ambles over there.

The simple solution was to assign cores to the processes. In linux, you do this with command: taskset

Taskset is followed by a bitmask value to assign CPUs - so I ran (in separate xterms):


taskset 0x3F java manyfeedServer
taskset 0x40 java theProducer
taskset 0x80 java theConsumer


and off they went. The server is highly multithreaded so I gave it 6 cores. The producer and consumer each got one.

The result? 270,000 messages/second! Wow. If my cache assumption was right (and I'm not claiming to know if it was) - it REALLY worked. One way or another though - something worked - we got better than a 2x speedup.

So you might be thinking the moral of this story is:

1) If you're in linux, install HTOP
2) If you're sharing a computer amongst CPU-bound processes, isolating the processes might (and very well "might not") be beneficial.

And ok, those are fine ideas. But what bugged me was that htop showed me that no CPUs were maxed yet. Again, what was slowing my application down ahead of CPU power?

I then tried limiting the server to 2 (hyperthreaded) cores. (I also tried keeping the producer and consumer on the same hyperthreaded "core" and given that I had cores to spare, also tried separating them, but the result was the same).


taskset 0x03 java manyfeedServer


Now, we get 530,000 messages/second. Nice. Reducing the cores from 6 to 2 nearly doubles our msgs/sec again. Here's the htop now:



You can see that cores 1 & 2 are plenty busy. Cores 3,4,5,6 are idle as expected. Core 7 (the producer) is pretty busy and so is Core 8 (the consumer).

530k msgs/second is nothing to sneeze at but.. um.. again, no core is maxed. Why - not? What's the bottleneck?

Obviously.. the last test is to throw the whole server on ONE cpu. Apart from the fact that I very purposely and meticulously coded this server be highly multi-threaded, fewer CPUs seem to make it happier. I am a conservative thread-safety-inducer. That is, I'm only really dangerous with firearms and synchronized blocks. But I'm by no means afraid to use the latter when needed.


taskset 0x01 java manyfeedServer


And finally, we hit 100% utilization on CPU 1 at 790,000 messages per second.



Here's the TL;DR of this blog post:

Some multi-threaded java applications apparently run faster on 1 core than on multiple cores.

Note the very non-committal phrasing in an attempt to make this a rather defensible statement. A graph showing the ManyFeed server's performance per number of cores:



So.. if you are a CPU or JVM guru and want to tell me your thoughts on what's going on, I'd love to hear it.

A long time ago, I benchmarked a bunch of different CPU types on doing contended memory stores. To my surprise, a little single-core Dothan processor did amazing in a few tests and I had no idea why. The discussion above is really the same idea. (I've retested that benchmark by the way on the Core I7 - and it has very different characteristics - probably a testament to the I7's new memory architecture).

Note that if you have a pre-I7 Intel multi-core cpu, you can reproduce my "1 core runs faster" results here using the code in that blog post (note: this code *is* a micro-benchmark, built to create a situation that causes extreme and unrealistic threading competition - for more details read that blog post). This gives a clearer picture of what's going on. My theory is that with 2 cores and 2 threads - each gets a core and every store operation competes for the memory bus in order to do its operation. Threads spend lots of time "competing" and less time doing real work. On one core (and 2 threads) - only one thread runs at a time - so every time it tries to store in memory, it can.

The Core-I7 is different (and maybe AMD cpus too) - it doesn't suffer from this memory contention problem at all. (Although - given that my server still runs (way way) better on 1 cpu even on I7, then maybe the contention is elsewhere in my server).

Here's that benchmark running for Static Volatile Store on a Core Duo CPU with 2 cores and again with 1 core. No performance loss on just one core, otherwise same exact run.



(numbers along the bottom are the number of threads in that run - it's not until we have 2 threads that contention creeps in and hurts us).

Oh.. and how does my little nifty manyfeed server do on a real network? Testing on 1 CPU on my local 1Gbps LAN I get about 300,000 messages per second.

With some quick (and hopefully not silly) math, this looks about right.

1Gbps = 1,000,000,000 bits/sec = 125,000,000 bytes/sec

125,000,000 bytes/sec divided by 200 bytes/message = 625,000 messages.

half for producer send and half for the server send (to consumer) = 312,500 each

On a real LAN, the server saturates bandwidth (i.e. becomes IO-bound) and that's no surprise (i.e. until I have access to 10Gbps networks, I don't need to speed up my server any more).

Interestingly I tested this idea of "1-core'ing-it" using apache tomcat and apache bench and saw no improvement. I also saw far fewer qps (loading a single, tiny, web-page over and over) and 100% core utilization even on multi-core. It'd be my guess that tomcat isn't contention-bound but truly cpu-bound.


In other words, don't follow this path without testing this yourself. The good news is that its extremely easy to try and you don't even have to change a line of code. Just try "1-core'ing" it.

55 comments:

sep332 said...

Arent' pairs of "logical" CPU's actually mapped to a single core? So running producer & consumer on 7 & 8 means they're running on the same physical core, obviously competing for resources and you'll never get 100% utilization on both at once. Same with running the multithreaded server on cores 1 & 2. Try using only odd-numbered or only even-numbered cores. E.g. server on 1 & 3, producer on 5, consumer on 7.

Chas Emerick said...

There's not a lot of info there to really know what's going on. Lock contention is a likely suspect, perhaps driven by GC. If you're really CPU-bound, you can peg 8 cores in a blink if you're building on stuff that has sane concurrency/parallelism semantics (e.g. use Clojure agents, or pmap, or similar).

If you're doing manual locking, you're just a little crazy these days given the better alternatives, IMO. Beyond that, peeking into what the GC is doing is worth a shot.

Tom said...

Interesting article! At the very least you got me using htop :)

Anonymous said...

Maybe you should check out Erlang for what you are doing?

http://erlang.org/

The telcos handle *millions* of concurrent processes with this language/vm.

matt said...

Multithreading has the overhead of switching between threads, which is quite expensive, so of course it is going to be slower. However, I bet if you leave it running on a bunch of cores it will scale much better.

matt said...

I also agree with using Erlang, even though I know you used Java because that's what you know :)

mikeh said...

Hi! I'd be really interested in seeing what happens if you eliminate hyperthreading from the mix by turning it off in the BIOS. It complicates things somewhat.

sep332 said...

Or you could just save yourself the confusion and just disable hyperthreading in the BIOS.

xsdg said...

Sort of echoing sep332's and mikeh's sentiments here. From my (incomplete) understanding, "hyperthreading" basically allows you to run overlapping bits of code on the same core. This can be a win if one bit of code is waiting on IO, or if you can use different units on the same core concurrently (arithmetic op + FLOP). If the pieces of code are doing similar or the same things, though, you're basically asking for contention.

Also, is it possible the latencies are asymmetric? Perhaps just run the simple benchmark pairwise on virtual cores 1, 2, and 3? (3! == 6 runs, total)

cpetit said...

Did you try to analyse and play with differents gc strategy ?

Anonymous said...

Actually Erlang is not very widely used by telcos. Ericsson uses other languages and tools.

One thing you should know if you program in Java is that unless you use the Solaris OS, Java threads are NOT native, i.e. they run on the same processor, not on different processors. That's one of the many shortcomings of the JVM.

Oh, and by the way, J2EE sucks huge donkey balls.

Jan Rychter said...

On the i7 once you limit yourself to one core you additionally gain from the core becoming faster — i7 will boost its speed if the other cores are idle.

It won't account for the entire difference you're seeing between 2 cores and 1 core, but it is significant.

SmarterAnonymous said...

Anonymous, was the last time you used Java in 1997? Most JVMs have native threads and have for over a decade.

Anonymous said...

Erlang was apparently dropped by Ericsson for a while (1998-2004), but then picked up again.



Java has been native threaded on Linux since 1.3 (May 2000).

Kristian Rosenvold said...

I have seen behaviour like this if your code basically locks all threads in hotspots. If you're starting a lot of threads and they basically just end up blocked/waiting you'll end up impeding total performance pretty quickly. Increasing performance with multithreading can be real hard.

You really *must* use a profiler to determine where you're blocking your threads. It's almost never where you think it is ;)

Eric said...

The problem is clearly **NOT** memory contention. The OS isn't aware of time a thread spends waiting on physical memory, and so that time would end up counted as active CPU time, not idle time. (*virtual* memory if a hard page fault occurs is another story)

Best guess is that you have a lock contention issue. Even if your code isn't using locks explicitly, it's likely that the APIs you are using have locks under the covers, including the socket APIs for your server.

PeterH said...

I'm thinking you may be facing overhead of data being transferred between the L1 caches of cores used by different threads. If the server was written so that the thread that received a message handled sending it out you might avoid that overhead.

Parallax said...

I'd be interested to learn how your messaging server compares with http://www.rabbitmq.com/

-Tom

Paul Tyma said...

@sep332 - I mentioned above that I tested this on a non-hyperthreaded CoreQuad and the results were the same. But you're right, HT just clouded the issue - when I finally wrote the post it just happened to be the machine I was working on.

@chas - I have no need to optimize this, but you're right, peeking into the GC could yield some interesting info.

@ERLANGERs - As pointed out, nothing against Erlang. I have a project to finish and didn't want to be impeded by learning a new language.

@JanRychter - interesting. Did not know that.

@Eric - this is very insightful. I will be correcting the post (after a day of comment accumulation) to reflect your comment. Thanks.

@PeterH - as I said, I avoid unnecessary context switches. In the general case, I already do exactly what you suggest. In cases where there are high numbers of consumers, the primary thread will ask for help (but that did not occur in my testing above)

@Parallax - maybe I'll test that someday. Of course I looked at rabbitmq but I felt my needs were somewhat outside what RMQ accomplishes (then again, I may find myself wrong and wish I had used that :)

sep332 said...

@Paul Ah, thanks. I totally misread that graph for some reason.

loganb said...

Your issue is almost certainly not memory contention per se. A stalled load/store to memory surfaces itself as excessive CPU utilization. Conceptually, it is an assembly instruction that takes many hundreds of cycles to complete and individual instructions are not preempt-able. Thus, you only see <100% CPU usage if the process/thread is not actually runnable and instead the core is spinning in the OS's idle loop.

Thus, your threads must be blocking at a point where they explicitly yield to the OS either when doing I/O, or when waiting on a lock/synchronized block. Depending on the number of threads, it's possible the contention lies in the OS's I/O subsystem, but I this seems unlikely. More likely is that when your app is running on many threads, there is contention for locks and threads are being put to sleep waiting for those locks. Then, possibly by fault of the OS, there is a lag time between when a lock is released by a thread, and when those waiting for it are returned to the OS run queue.

In short, I think you need to dig into what locks you're taking and conditions you're signaling in the course of processing each message. That your CPUs are spending time idling indicates lock contention over memory bandwidth contention.

Anonymous said...

Wrt. the hyperthreading, it would be interesting how a bitmask of 01010101 performs, i.e. hiding hyperthreading for these apps and only allowing 4 real instead of 8 fake cores.

eyrieowl said...

Not the first person to suggest it, but it seems highly suggestive of context switching to me. This is a cost I think most people discount to easily. And it's not surprising to me that the fewer the cores, the higher the transaction rate, particularly if you have tens of threads. The way I see it, the more threads you have wanting to do work, the greater the likelihood that the next time your thread gets scheduled, it will end up on a 'random' core. However, if the number of cores is low, it improves the odds that it ends up being the same core it was running on before. Unless the rest of the threads are just hanging out twiddling their thumbs, they'll want some CPU time, and Java is not (yet) very good about trying to keep threads and data colocated. This is the Big Challenge for highly parallel systems, and it's why you see groups such as the one at Rice University working on the Habanero Project where they are building into their language and VM core affinity. Particular with Java, I think it becomes a challenge because even if you have only as many application threads as cores, you still have these other VM threads beyond your control which can knock one or more of your threads from their perch, which can cause everything to get off so as to maintain some fairness. Java 7 is adding some NUMA logic to the GC to help optimize memory placement of objects, but I'm not sure what improvements there might be in thread placement/scheduling.

sandeep said...

Ever think of Scala - Akka actors + JVM makes me think that it was the perfect match for your use-case.

Lari Hotari said...

"Processor Adjacent Sector Prefetch" and "Processor Hardware Prefetcher" bios settings might also make a difference.

Quoting http://www.softpanorama.org/Internals/Performance_tuning/lecture02.shtml#Processor_performance_ :

"Both prefetch settings do decrease the miss rate for the L2/L3 cache when they are enabled but they consume bandwidth on the front-side bus which can reach capacity under heavy load. By disabling both prefetch settings, multi-core setups achieve generally higher performance and scalability."

AndyGlew said...

(1) Disable hyperthreading. Most supercomputers do.

(2) The comment that memory contention should still cause CPU utilization *should* be true. I can imagine ways in which it might break, but I don't think those brokenesses are in current kernels.

(3) I'm a CPU guy, so I would suggest using EMON - on Linux, accessible through tools such as lkml.org/lkml/2009/6/6/149, oprofile, perfctr and perfmon2.

I would want to compare number of cache misses (from all important cache levels: L1, L2 (MLC), L3 (LLC)), bus utilizations, etc.

E.g. if the "bus" (QPI on Nehala) is above 80% utilized, then multiple processors probably won't help, and will probably hurt.

Paradoxically, if you have done a good job of parallelizing your code, making the threads NOT contend, then you may have shot yourself in the foot data cache wise - since t is possible that the footprint of 1 CPU's worth of threads fits in the cache, while the footprint of the larger number of concurrently running threads in a multiple processor system does not. You attempts to let each thread run as long as possible would exacerbate this.

Nimrod said...

Thanks for the interesting post and the insightful discussion.

Regarding hyperthreading - HT was specifically designed to maximize processor utilization when memory/cache cannot keep up with computation.

Whenever a thread stalls due to reading from memory (writes may be easier - due to write back queues), the processor automatically switches to another thread running on the 2nd virtual processor. So overall utilization of the single physical processor will be higher.

Frank said...

Why is the server multithreaded anyway? All it does is something like

Loop
If new message, get it and broadcast it

so if you spawn 20 of these threads they are all just trying to get the message, right?

Noel Grandin said...

You're obviously hitting some shared resource, which is triggering cache-line bouncing, and you're doing so little work outside of moving packets around, that the inter-CPU traffic which cache-line-sharing triggers is dominating the runtime.

Are you sure there are no shared resources? Maybe you're creating threads, in which case you're hitting java's thread create lock, or maybe you're synchronising on something?

I've found the new classes in java.util.concurrrent to be extremely useful in writing good parallel code.

Anonymous said...

this is what happens when you post a highly subjective topic:

you get people who know more than you chiming in...

good one taskset though

OraclePerformance said...

The neat thing about messaging programs is that they are horizontally scalable. You can scale producers and consumers blindly.

What the benchmark is showing seems to be protocol overhead. If you want a really fast 1 producer to 1 consumer - just hand off without messaging. Otherwise - you now have evidence you have a protocol wall. Your next step is likely either to change the protocol or go into protocol analysis (which is a path to java api and jvm and truss/strace/dtrace hell).

You know how to validate this point - simply check the bare messaging without a payload.

Hiram Chirino said...

I ran into a similar scenario a few months also when working on a messaging server. Cross thread contention is amazingly difficult to reduce. I just posed Scaling Up With HawtDispatch which outlines some of the techniques that HawtDispatch uses to help avoid the contention.

Regards,
Hiram Chirino

Zetan Drableg said...

Run a 'kill -QUIT ' to get a thread dump during the test. Look for threads in a BLOCKED state.
-zetan

Paul Tyma said...

to @anonymous who said:
"this is what happens when you post a highly subjective topic:
you get people who know more than you chiming in..."

As I mentioned in the original post - that was sort of the entire point.

And these comments have been awesome. I plan on collecting them and trying a bunch of these ideas and report back.

Mattias said...

Posted this on the other benchmark blog post by mistake, should've been here!

a)There is a lot that can be going on behind the scenes in the jvm here to start with. Such as similar to your other benchmark where the jvm will know that is running on a single-core cpu and can skip on emiting lock-prefixes on some instructions. It can even turn memory barriers into no-ops etc. See:
http://www.infoq.com/articles/memory_barriers_jvm_concurrency

And according to Dave Dice (JVM guy at Sun):
"On MP IA32 and AMD64 systems we have to use locked atomic instructions (e.g., lock:cmpxchg) to acquire locks. On uniprocessors we can do without the lock: prefix. lock:ed instructions have considerable latency on some Intel processors (>500 cycles)."

So running on most x86 one-core machines (also when just allowing the process to use one core?) it will be a lot quicker to do synchronization.

b) Hyperthread cores are paired with a real core and usually provide faster context switching etc but that's all. No additional ALUs etc.

c) The cpu itself, such as:
c1) An i7 shares the rather large and quick L3 cache among cores. Running one core only you are effectively running with ONE very big L3 cache instead of it being split up between cores.

c2) Running on a few cores or just one core, the i7 cpu overclocks the used cores.

c3) I7 runs synchronisations primitives a lot faster, LOCK COMPXCHG instructions in 60% of the time compared to the previous generation for example.

Vincent said...

I believe Mattias has the write answer.

The LOCK prefix will certainly be used when synchronizing on a global data structure (probably a queue) on a multiprocessor machine.

LOCK prefix ensures a CPU has ownership of the bus, which, can flush the cache of *other* cpus if I'm not mistaken.

Eliminating the LOCK prefix is *huge* for performance, especially something like this.

Christian said...

What happens if you run multiple independent copies of the server as separate Java processes on different CPUs? If your architecture supports it, this would make it much easier to scale and avoid all locks (as these are then handled on application layer and not on hardware/CPU layer).

raggi said...

Of the order of 100s of MB per second, CPU architecture is so far from the problem it's not even funny. I'm amazed at some of the comments being made here. Discuss, read, and profile the software stack before you start making random wild guesses about hardware.

We can push >100MB/s using reputably slow languages like ruby, sitting over 60% of their time in the GC scanning heap temporaries and the stack over and over and over, being so inefficient it's laughable. The CPU can do that. Java should be far far more efficient, and indeed in tests it's easily provable to be so.

The PEAK rate described here is in this range and is so much slower than memory bandwidths (at least one order of magnitude if I remember by DDR2 based FSB speeds correctly).

The cache is irrelevant here.

On my slow-ass-by-comparison macbook here, lets do a quick test running off a volatile volume:


raggi@mbk: ~/volatile % time dd if=/dev/zero of=bigfile bs=1048576 count=500
500+0 records in
500+0 records out
524288000 bytes transferred in 1.551369 secs (337951845 bytes/sec)

real 0m1.556s
user 0m0.003s
sys 0m0.793s


Now, /dev/zero probably isn't all that efficient, who knows, I don't really care, lets see if cp(1) can do any better:


raggi@mbk: ~/volatile % time cp bigfile bigfile2

real 0m2.351s
user 0m0.003s
sys 0m1.136s


Meh. that's a bit faster. Parallel read and write, reading 500mb, writing 500mb, total aggregate data transfer of 1GB in 2.4s, or ~400MB/s.

Also, note the split between real, user and sys.

Have fun with comment speculation round 20 or whatever it is, and good luck with the product!

raggi said...

Oh, and further to my last, I should note that I'm now a couple of hundred megs into swap as a result of doing that. Go figure.

raggi said...

It also just occurred to me that some "bright spark" will chime in that I'm benching a single IO locked process in volatile memory with the prior examples I gave in comments, so to be sure I'm invoking at least two processes and the kernel, here's another 'raw-ish' test:


raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"

real 0m2.846s
user 0m0.117s
sys 0m2.185s

Still well over 300MB/s and most of the CPU time is being spent in `diskimages-helper` (OSX SL). ciao.

bramcohen said...

The way Java encourages you to do thread interaction is truly awful, and you're probably suffering from following what the language is suggesting you do. You'd probably be better off removing all uses of the word 'synchronized' from your code and having threads interact with each other strictly through queues, with as few objects which multiple threads are stepping on as possible.

Matt Williamson said...

Actually, Bram is recommending a fantastic idea.

ratchet freak said...

my guess is that the jvm knows that when it only has 1 core available, it doesn't need to skip the cache for volatile memory ops

as there isn't any other processor that needs an up to date value of it

(this is assuming the jvm knows how many cpu's it is scheduled for)

Peter Lawrey said...

If you have program which should be single threaded, but you make it multiple threaded, it shouldn't be surprising that if you force it to use less CPUs the performance will improve.

On my system, the "synchronized Static store" test scaled particularly badly. However, even the single threaded case (53K, the best for this test) performed far worse than the single threaded Static store. (1721K)

i.e. it should never have been made multi-threaded, even if it scaled linearly, you would have needed 35 threads to match the performance of a single thread of the simpler non-synchronized test.

My approach is to minimise the amount of interlock between threads, this minimises the dependancy on how the hardware performs. IMHO This is not practical in all situations, but this approach can be used more often than most people think.

Peter Lawrey said...

BTW: The code would be simplified if Goer timed how long the test runs. If the test is run long enough, the difference should be small, but would save three lines of boilerplate code from every test. You can also make the test run for a fixed time rather than having a hand chosen looping factor.

In each test you set junk1 to be the counter, and in some tests you set junk2 to be the last value, but in others you appear to have a copy-and-paste error using a variable from a different test.

shenoyjoseph said...

nice trick

Likai Liu said...

I'm going to second with Chas Emerick who mentioned lock contention. However, the state of art garbage collector is very likely more scalable than your server architecture.

Maybe your server implements some sort of work stealing scheduler that isn't aggressive enough? Spending too much time in the critical section? Ever heard of Amdahl's Law?

Anonymous said...

I've seen many a Java app where I constrain it to one or two CPUs and see improvement in actual work done. Takes a bit of tuning with jvm threads at times and sometimes tradeoffs with using more CPU and few jvms than a constraining to a single CPU for ever jvm just to reduce the care of feeding and memory overhead. It also depends on the exact version of Java you are using. This used to be FAR worse with Sun Java 1.3 but 1.4 and higher really it much less of a factor

My theory on why less is more is lock contention just like several comments have already mentioned. When waiting on resources having lots of threads that can at the same time compete for the same resources makes for fights. When forced to run on a single CPU only one JVM thread ever runs at any time since all threads have to run in the same core and hence there is never a fight for resources.

With that being said watch out for singleton areas and avoid mutexes as much as possible. Try turning off logging and see how much logging overhead is a factor. Try running with all the CPUs enabled but reduce the number of JVM threads to be equal to the number of cores plus some fudge factor to account for housekeeping (2 times cores usually is about max). Also there are different lock manager that can be used that may not be as fair as the standard but don't have the overhead.

Anonymous said...

I have another idea that might be fun to try and speed up Mailinator

Replace your "aging hash map" which you use to do your spam filter with a time-based, counting bloom filter.

http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/

paradoxo42 said...

Great.

Emerson Almeida said...

I am Of Brasil, Muito Bom.

Rodrigo L. said...

Very practical and simple.

Jonadab said...

Do your threads have to talk to each other much? Fundamentally, any time two threads (or processes, or programs, or computers, or networks) have to communicate with one another, at least one of them can end up sitting around twiddling its thumbs waiting to hear from another one.

Even if the communication is taking place over the loop interface, it still has to take place, and therefore things can be I/O bound. Unless your threads can consistently keep doing stuff while they wait to hear from one another, you're going to have the potentially for I/O bottlenecks to prevent full CPU utilization.

This is interesting and possibly worth being aware of, but you really only need to do anything about it if you're going to be deploying the code in a way that introduces higher latency (e.g., if some of the threads could end up on a different computer from one another, especially if they could end up with slower-than-LAN links to one another).

Matt Kemp said...

From my understanding (and what I've empirically observed) is that synchronized in Java causes an explicit cache flush at the end of the synchronized block. Depending on the processor this most likely flushes caches on multiple cores. Which would then cause them to re-fetch data from memory, slowing down their processing speed. Not 100% sure on why the single core is so much faster, but it probably has to do with either JVM optimizations when running on a single core or more efficient OS scheduling of fetching data for threads (since all threads are bound to a single core). As a related point, binding specific Java threads to cores can drastically improve those threads performance.

Andy "Krazy" Glew said...

" synchronized in Java causes an explicit cache flush at the end of the synchronized block. "

Not true - at l;east not on Intel and AMD x86es, which I worked on.

I don't think it's true on Sun/Oracle SPARCs either. But I have worked on machine proposals, never built, where it might be true. And it might be true on a non-cache coherent MP system, which are uncommon.

What synchronized does is that the synchronized b lock is surrounded by a lock acquisition and release. Depending on the machine, this can be quite expensive. On modern Intel machines, however, it is almost free unless there is lock contention.

Oh, yes: finally, given hardware TM, which is just starting, the moral equivalent of the cache flush might happen. If they use the design I did 7 years ago for Intel, doing too many transactions back to back may incur a delay while the cache is scanned to flip bits. Not actually a flush, but still possibly a scan. However, this is a "too many back to back" thing. If fairly well spaced, or if Intel has provided enough versions, no problem. Or if Intel did a flash clear.