Saturday, May 5, 2007

Nonblocking-Reader/Blocking-Writer in Java

So it's Saturday. I should be off like chasing women or motorcycle racing or something, but instead I'm at home obsessing over non-blocking threadsafe algorithms.

(yeah. again.)

In particular, I had an email exchange with (my ex-Ph.D.advisor) Doug Lea regarding the java.util.concurrent.lock's ReentrantReadWriteLock (which Doug also happen to be the author of). The beauty of this class is that it provides a solution to the classic Reader/Writer problem. That is, it allows many threads interesting reading something to operate simultaneously (as reading does not conflict with reading). However, it also allows you a writer-lock that when switched on, blocks all readers and other writers. This follows the assumption that writing could corrupt simultaneous readers or writers.

This wasn't very easy in Java prior to JDK 1.5.

Our discussion was over a microbenchmark I wrote to prove to myself how great it was that I could have a flock of readers all banging on a map (or something) simultaneously. I wrote (or borrowed) a standard regularly synchronized map wrapper:

public synchronized V put(K key, V value) {
  return map.put(key, value);
}

public synchronized V get(Object key) {
  return map.get(key);
}


That version will have readers waiting for both writers and other readers. And one that took advantage of the new Reader/Writer lock:


ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
Lock readLock = lock.readLock();
Lock writeLock = lock.writeLock();

public V put(K key, V value) {
  writeLock.lock();
  try {
    return map.put(key, value);
  } finally {
    writeLock.unlock();
  }
}

public V get(Object key) {
  readLock.lock();
  try {
    return map.get(key);
  } finally {
    readLock.unlock();
  }
}


Here, readers will not wait for other readers, however they will wait for writers (and writers wait for everyone).

This example does do better, but maybe not as "better" as I would think. Pounding the code with 50 simultaneous readers runs about 6.5 times faster (i.e., I got 6.5 times as many operations in a 1 second interval) than the fully synchronized version.

Interestingly, with 5 readers and 20 writers the synched version does about 10000 writes and 1200 reads. The reader/writer version does 300 writes and 10000 reads (sort of the opposite). In other words, the writers (probably expectedly) really get hurt in the reader/writer configuration. (If you're complaining microbenchmarks suck, you can stop. Highly contended parallel access to a map is pretty real - and if nothing else, I was really just trying to make threads fight, and this seemed like a reasonable way to do it).

Doug quickly pointed out that the reader/writer lock has a fair bit of overhead as compared to a simple synchronized method and that the map operations are so fast, they simply don't hold the lock long enough to cause much contention.

I probably should have quit there and played some more desktop tower defense, but I tried something else. Of course it seems that (in true double-checked locking form) if a reader can know that no writers are about, then there's no need for it to do any locking at all. Thus I played around awhile with a non-blocking threadsafe reader. This didn't go so well as I found counterexamples to every version of my code right as I typed the final semi-colon.

Then last nite I woke up at 3am (yeah, I know) and wrote this:


AtomicInteger writeCounter = new AtomicInteger();

public V get(Object key) {
  int save = 0;
  V value = null;
  do {
    while (((save = writeCounter.get()) & 1) == 1);
    value = map.get(key);
  } while (save != writeCounter.get());
  return value;
}

Lock lock = new ReentrantLock();

public V put(K key, V value) {
  lock.lock();
  try {
    writeCounter.getAndIncrement();
    map.put(key, value);
    writeCounter.getAndIncrement();
  } finally {
    lock.unlock();
  }
  return value;
}


The write lock used in the put method is pretty standard. Only one writer can be active at any time. The reader code (i.e., the get method) is the tricky part. Effectively, a reader won't start while it knows that a writer is active. And, once its done it checks again to see if a writer is now active or was while it was running. If so, it throws away its result and tries again. In theory, a reader might reread the same value 1000 (or infinite) times before its satisfied that the data is good. In practice, it only happens a very few times.

In threadsafe (blocking) code we're used to coding such that all important code can only happen when its safe. Nonblocking code works more like a "teenager model" - it says "Let's try it and then when we're done, we'll see if what we did was ok. If not, we'll try it again until it is ok."

If the very busy looking do or while loops seem odd to you, they're quite normal in non-blocking threadsafe code. See Brian Goetz's great introduction to nonblocking algorithms. In the right circumstance (as Brian is happy to point out), nonblocking code can smoke blocking code even if it ends up busy looping awhile. And of course, still be 100% correct. Of course, the latter statement is often difficult to make happen.

I love this kind of all hell's breaking loose code. I created some tests to try to expose a race condition and so far it looks good. I ran this idea by Jeremy Manson ("Dr. Memory Model") and he confirmed for me that the use of atomic ints above will cause memory consistency (keep in mind, Jeremy is not a big fan of lock-free algorithms it seems). My empirical tests verified it so far - I can't make this code race (I can easily make races by changing just tiny little bits). I'm going to play with some benchmarks and see what I can drum up.

(Edit: Honestly, if you're looking for a non-blocking hashtable the above is thought-provoking but not terribly clever. You'd be wise to check out Cliff Click's Nonblocking Hashtable implementation. Its more comprehensive, proven, and is very likely faster.)

16 comments:

Markus Kohler said...

Hi,
You might be interested in Cliff Click's
non blocking hashtable .
Source code is now available.

Regards,
Markus Kohler

Alex said...

So how does your version of the put/get compare performance-wise to the simple synchronized() way?

Logan Bowers said...

I recall something about writes on unprotected variables potentially being partial. That is, if a non-volatile variable is concurrently updated, a reader thread may see neither the initial value nor the updated value but instead something intermediate.

Although I can't imagine a hardware implementation that would cause it, it seems plausible that a reference may be partially written and thus invalid. If that occurs, the JVM is probably within its rights to do all sorts of unhappy things (like throw an unexpected Exception or Error).

In order for a try-and-see approach to work the JVM must guarantee consistency of all references at the very least.

Paul Tyma said...

markus:
I've seen Cliff's hashtable - in fact I attended his talk at Google. I've followed Cliff's work for years, it was a treat to finally meet him and his hashtable is excellent - it does things far smarter than my code above.

alex:
I have done only informal performance tests and plan on doing further testing. As expected, the results are fantastic (which is why I want to be very careful about having good results). This isn't really that amazing however - nonblocking algorithms do tend to destroy blocking algorithms. But - this isn't so cleancut - cpu utilization spikes too.

Logan:
I've been very worried about the Java memory model's consistency issues (note this code wouldnt even compile in JDK1.4 but even if it did, it probably then wouldnt work). Thats why I asked Jeremy Manson who is truly an expert in this area. My understanding from my discussion with him is that the operations on the atomicInt provide the memory barriers I require.

martti vh said...

What was that z-variable all about? It disappeared when I refreshed the page. Damn, I really could have used those extra points... :) Did it have something to do with memory consistency?

Paul Tyma said...

martti: Damn, you caught me. My thought was that the assignment to value in the get is loop-invariant. That is, a compiler (or runtime) might realize that its silly to keep doing that calculation within the loop body when it doesn't look like it will change. And, as an optimization, move the statement out of the loop (breaking the whole point of the atomicInteger as a guard)

To counter that I added a spurious check in the while loop to see if value was then equal to some public static variable (which it theoretically never could be which is ok - but the compiler (hopefully) wouldnt know that). Note that some decompilers do something similar.

Anyway, I got the idea from Jeremy that because of the atomicInt usage, the compiler wouldn't do the loop-invariant code motion I feared. My empirical tests verified this.

Anonymous said...

I may be wrong or missing something, but if there is more than one field write as a consequence of calling map.put(), or if the field write is not atomic, then you are not guaranteed a consistent view of the map during the read operation - the map.get() - as the JLS doesn't guarantee atomic application or view of fields that are not read in the scope of the same lock, in other words, your reader threads could see the state of the map in an incorrect state.

While I would probably bet that map is probably not written to suffer from that problem, it seems you have a problem in theory, anyway.

I think this is probably what logan was referring to.

To reiterate - it seems that you have the proper barriers to prevent threads from accessing critical sections of code at the incorrect times, you haven't addressed the memory model and the possibility of a partial update having occurred.

Isn't this why ultimately the double-checked lock idiom failed?

Jeremy Manson said...

I am a big fan of lock-free algorithms. I'm not a big fan of people who don't know how to write lock-free algorithms trying to write them.

I am also not a big fan of using them when you don't have to use them. I have seen a few prematurely optimized attempts at lock-free algorithms that are... well, let's just say "sad".

And if you keep calling me Dr Memory Model, you are going to force me to call you Dr Optimizing Transforms.

Iceman said...

Correct me if I am wrong...

It seems you have thread barriers, but not memory barriers. You could end up reading partial flushes because you are not using any kind of construct that would ensure a coherent read of the map when doing map.get() (according to the JLS, synchronized doesn't just prevent more than one thread from entering critical sections, it also ensures that all mutations made under the same lock are visible by other threads acquiring that lock.)

HashMap itself may not have a problem, (I don't know what it's implementation is internally) but if you tried to do what you are doing at a general level, multiple field writes made under the write lock could be read in non-coherent states by the reader threads.

Sudarshan said...
This comment has been removed by the author.
Cliff Click said...

Re: Partial writes on non-volatile variables: you are guaranteed to never see a broken/partial ref (breaks type-safety, it might be mistaken for something real). In practice only misaligned longs & doubles on X86 get split this way.

Cliff

Iceman said...

I am not referring to a partial ref or write. I mean updating more than one field at a time. You are not guaranteed to see a consistent view of more than one write unless you have a memory barrier. Synchronized provides the memory barrier, but I don't see how this code does.

Iceman said...

After reviewing http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/atomic/package-summary.html
and http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html, I've convinced myself that there are in fact the appropriate memory barriers provided by AtomicInteger.

eyrieowl said...

:) i have done a ReadWriteMap impl as well. Several. One is a wrapper, like synchronized map, and one is an extension with the RWLock built in. Some other cool things you could consider are to enable the collections (don't just stop with Map) to take an optional RWLock argument in the constructor so that you can have a series of data structures inherently locked with the same lock (can be helpful if you need to lock and update multiple structures at once).

i'm surprised you got the results you did for the write lock, though. i get much more efficient writes with a concurrent lock than with a synchronized block in my microbenchmarks. :) at this point, i don't even touch synchronized collections anymore unless absolutely necessary.

the other fun item of note, if you are interested, is the memory consumption. if you have lots of maps, synchronized is, of course, best, RWMap next, and ConcurrentHashMap last. now, one way around that, to have concurrency (if you want that sort of thing) is to do an adaptive concurrent map which resizes its segments if it notices that it's being utilized by many threads (ideally, concurrently). i've also done one of those which i've found a few choice places to use. i'll have to check out cliff's map, though, i'm curious to see what its memory profile is like. (as you may guess, the system i'm working on has many many many maps it maintains, so the size of individual maps is important...hence my using prime sized maps instead of ^2, among other things).

now, question...you might actually know this b/c you seem to know the right people (how come I don't know the right people? :/ i really need to get back to school at the right school.... ) i'm really interested in a non-blocking LRU map...know of one?

Raphfrk said...

A bit late, but is this still considered valid?

If the writeCounter is even, that just guarantees that the the lock has been released at some point, but not that it hasn't been locked again.

There is a relationship from write(odd) to read() == odd that guarantees that the read has happened after the write.

However, if the read() == even, it doesn't guarantee that a write(odd) hasn't happened by another thread.

Paul Tyma said...

It's been awhile since writing this but let me think about it.

>If the writeCounter is even, that
>just guarantees that the the lock >has been released at some point, >but not that it hasn't been locked >again.

Very true. But the read doesn't really care about the lock. It cares about the counter.

Even if we race and a writer grabs the lock but doesn't increment the counter yet, a read can startup, check the counter, do the read, and get out.

>There is a relationship from
>write(odd) to read() == odd that
>guarantees that the read has
>happened after the write.

Yes, more correctly there is a relationship between a changed counter and a read. So if a writer makes it odd - or makes it odd then even again (by adding 2) - the read gets upset.

The read code basically says:

I won't start if the counter is odd. And I won't call any result valid if the counter changed while I was reading.

It's imprecise but in the conservative/safety direction.


>However, if the read() == even, it >doesn't guarantee that a >write(odd) hasn't happened by >another thread.

No. The variable is Atomic (i.e. volatile).

Again, if a write (in another thread) changes the counter at all AFTER a read has started (verifying it's odd) then the read will know about it. And call the result it got invalid. And try again.