This is a question I've wondered awhile. And I've gone ahead and made a stab at finding out. One question that I had was I had always heard that AMD chips have an on-cpu memory controller and Intel's don't (at least not until they're upcoming Nehalem line). Now I'm first to admit I'm talking above my knowledge (please correct profusely).
But the idea was that some operations (notably I've heard volatile reads which I didnt test here) are very similar regardless of volatility.
I spent a few hours making up a benchmark. I'm publishing the benchmark below for one important reason - to get it right. This is big disclaimer: Don't trust these results - at least with any precision. I do conclude that contended synchronization is expensive - thats a pretty expected result so hopefully we're somewhere on the right track.
However, Java benchmarks are very hard to right. So hard in fact that they do a hardness-overflow and end up looking easy - which is precisely what makes them so hard to get right. Hence - given that only I have seen this benchmark, its pretty likely to be broken some (and hence why its published here for public review to help fix).
With all that...
What I've tested here are storage operations. Specifically:
Storage to a local variable (mostly used as a control)
Storage to an instance variable
Storage to a static variable
Storage to a static volatile variable
Storage to a static AtomicLong variable
Storage to a static variable protected by syncrhonization
I tested this on 3 different CPUs, all with JDK1.6 and variously on Linux and Windows. One very important point is do *not* compare the absolute numbers in the graphs directly. In no case are comparing the same CPU across 2 operating systems. The CPUs are wildly different in capability.
So we can explain more with a graph:
This is the run on a dual-core dual-Opteron machine. Note the blue-bar is local variable storage. As you'd expect it simply destroys everything else. This is very probably because the smart JVM went in and realized it could optimize the assignment out of the loop. Note the blue-bar just gets happier and happier as we add cores. The threads are completely unaffected by each other and we get actual linear increases in power. Even at 5 threads (note we have 4 cores) we get some speedup.
The orange and yellow bars are instance variable and static variable stores respectively. Some level of indirection or inability to optimize jumps in and makes things a bit more sane. Surprising we get little or no improvement as we add threads.
The three thread-visible methods of store are so slow they barely show up. This brings up an important point, let's assume you wrote an application and made it correctly thread safe. Using synchronization or volatiles or whatever. You *cannot* optimize away synchronization in the name of performance. It simply can't be done. (conversely, if you could then it wasn't correctly thread-safe to start). You can however, at times, replace one type of synchronization primitive with another.
That being said, lets take a closer look at the same graph with just the barrier memory store operations.
Volatile and Atomic are neck-and-neck. Synchronization is still a big loser, especially after we contend giving it two threads.
One other anecdote - while running the non-memory-barrier benchmarks, my CPU meter showed 100% user space usage. With these benchmarks however, it went to 30-40% kernel cpu usage.
Ok, jumping to the Core2Quad Extreme CPU. Definitely a faster processor but with a different memory architecture.
Local store again goes flying. Oddly the instance store doesn't do much but the static store increases nicely with the cores. And, once again, we can't even fricken see the barriered writes.. so here they are.
Look how great the JVM does knowing that just one thread is out there. It can really optimize the code to elide or at least marginalize the impact. Surprisingly, synchronization still gets nailed across the board.
Note that according to what I have so far, a synchronized static store is something 50 times slower than a simple static store. And 10 or so times slower than a volatile static store.
One more.. and its a fun one. Windows XP running on a Dothan. (Funny, I realized I hadn't directly known that Dothan was single core, but I just assumed it after I saw the graphs.)
Crazy. Add a few threads and performance doesn't even budge. Of course, with only one core a system can completely avoid all the complex architecture keeping cores in sync. Why its local writes are worse as compared to instance beats me.
Also, although I said don't compare graphs, this little single core Dothan beats the Core2Quad in all barrier-ed writes after 2 threads. Note that on the local writes, the Core2Quad is something like 50 time faster. But even on the simple static volatile store - at 2 threads, the Dothan is now twice as fast.
So. Pardon my informalness here - I'm actually quite expecting feedback to break this code in grand ways and force me to redo all the runs and rewrite all the text (which I'll be happy to do if we get a clean bench out of this). I have no intention of making very specific claims once this benchmark is firmed up - but I would like to have a "sense" of the costs of these different operations.
Source code for the Benchmark.