Read-copy update, or RCU, is a technique for implementing efficient read-mostly concurrent data structures on cache-coherent multiprocessor machines. RCU was originally developed for the DYNIX/ptx operating system, and then found its way into IBM’s K42 research operating system for cache-coherent multiprocessors. RCU is now most deployed in Linux, and one of RCU’s developers, Paul E. McKenney, works on RCU APIs for the Linux kernel. The RCU approach has grown into a larger project called Relativistic Programming.
A concurrent data structure is designed to operate safely under concurrent access. RCU is one of the best concurrent data structure techniques there is, because RCU makes uses cache lines and modern memory systems so wisely. As a result, RCU read-side code often performs as fast as unsafe, unsynchronized code—a wonderful property that many concurrent data structures do not share. But it’s not exactly easy to understand why, and why other concurrent data structures can fall short.
RCU distinguishes between read-side code and write-side code. Read-side code does not modify the data structure; write-side code does. (Data structure operations are classified as either readers and writers, and we sometimes use this terminology for short, but it’s important to note that writers can contain read-side code. For example, a writer operation that removes an item from a tree may first execute read-side code to find the item.)
Synchronization is required to avoid conflicts between write-side code and either read-side or write-side code. RCU is especially concerned with efficient management of read–write conflicts. To manage write–write conflicts, most RCU data structures use regular locking.
Read-side concurrent code can run as fast as unsynchronized code only when it uses memory in the same way as unsynchronized code. This means that:
Read-side code cannot use other expensive instructions, such as memory fences.
These requirements basically eliminate read/write locking from consideration. In the simplest read/write locks, which maintain a precise count of the number of concurrent readers, read-side operations must perform atomic writes to shared data structures. Faster read/write locks are possible, for example by allocating a separate “private” cache line per processor for the lock, but are hard to program without wasting memory space and/or time.
It is natural to imagine that multiprocessor machines are serializable (or, equivalently, sequentially consistent). This would mean that any set of threads executing on a multiprocessor would always have the same effect as some interleaving of those threads running on a uniprocessor. For example, threads A and B running concurrently, like so,
A1 B1
A2 B2
would produce the same effect as one of the six interleavings that respect program order:
But modern machines just don’t work this way, because enforcing serializability is incredibly slow. Instead, caches and store buffers generally delay when a processor’s writes become visible to all other processors. This delay cannot be modeled by interleaving: the program above might produce some weird result.
But how does this affect us C and C++ programmers?
Modern high level languages include memory models that define how different threads’ assignment statements and variable references can interact. Memory models constrain how compilers and interpreters can choose instructions. For instance, if language X has a sequentially consistent memory model, then X’s compiler must insert expensive fences around most shared memory accesses. The goal is to ensure that any execution of the compiled program would be allowed by the memory model.
Well, modern C and C++ define a relaxed memory model that gives compilers maximum flexibility. This means in practice that C and C++ expose all the weirdness of modern machines’ memory systems, plus even more, since C and C++ compilers can introduce weirdnesses of their own!
Many published concurrent data structure algorithms assume a serializable memory model. That is, the algorithms will work as presented only on serializable machines, or languages with sequentially consistent memory models. But such machines don’t really exist any more, and such languages are slow!
Does this mattern in practice? Well, consider the following operations, executing on two x86 cores:
A1. mov $1, x B1. mov $2, y
A2. mov x, %eax B2. mov x, %eax
A3. mov y, %ebx B3. mov y, %ebx
There are 20 interleavings of these instructions, but those interleavings can produce only 3 distinct outcomes:
Operations x y A: %eax %ebx B: %eax %ebx
1. A1 A2 A3 B1 B2 B3 1 2 1 0 1 2
2. A1 A2 B1 A3 B2 B3 1 2 1 2 1 2
3. A1 A2 B1 B2 A3 B3 1 2 1 2 1 2
4. A1 A2 B1 B2 B3 A3 1 2 1 2 1 2
5. A1 B1 A2 A3 B2 B3 1 2 1 2 1 2
6. A1 B1 A2 B2 A3 B3 1 2 1 2 1 2
7. A1 B1 A2 B2 B3 A3 1 2 1 2 1 2
8. A1 B1 B2 A2 A3 B3 1 2 1 2 1 2
9. A1 B1 B2 A2 B3 A3 1 2 1 2 1 2
10. A1 B1 B2 B3 A2 A3 1 2 1 2 1 2
11. B1 A1 A2 A3 B2 B3 1 2 1 2 1 2
12. B1 A1 A2 B2 A3 B3 1 2 1 2 1 2
13. B1 A1 A2 B2 B3 A3 1 2 1 2 1 2
14. B1 A1 B2 A2 A3 B3 1 2 1 2 1 2
15. B1 A1 B2 A2 B3 A3 1 2 1 2 1 2
16. B1 A1 B2 B3 A2 A3 1 2 1 2 1 2
17. B1 B2 A1 A2 A3 B3 1 2 1 2 0 2
18. B1 B2 A1 A2 B3 A3 1 2 1 2 0 2
19. B1 B2 A1 B3 A2 A3 1 2 1 2 0 2
20. B1 B2 B3 A1 A2 A3 1 2 1 2 0 2
That’s on a sequentially consistent x86. But what about in reality? On x86-class machines, all processors eventually agree on the order that stores commit (a property called “Total Store Order”). But there is one important exception: a processor sees its own stores before those stores become globally visible to other processors. This exception means that x86 allows a different outcome, one that cannot be achieved by any serial order. Specifically:
xx. .. .. .. .. .. .. 1 2 1 0 0 2
So x86 store forwarding allows A to see its assignment to x
before B’s assignment to y
, while B sees the assignments
happening in the other order. But in any sequential order, either A’s
assignment happens first or B’s does.
Note that this weird outcome happens on x86, whose memory model is relatively easy to understand. Architectures like Power and ARM have even more relaxed semantics: one core’s stores can become visible in different orders to different cores!
OK, relaxed memory models can’t be avoided. But serializability still
seems much easier to understand. Many researchers and implementers have
therefore believed that concurrent data structures must always provide
serializable semantics: that non-serializable semantics is somehow “wrong.”
Their goal is to implement high-level operations, like hash table put
and
get
, or balanced tree operations, that have serializable effects. The
implementations for these operations might require expensive instructions,
but hopefully these can be minimized.
But the de facto standard correctness condition for concurrent data structures is actually even stronger than serializability. It’s called linearizability. Linearizability distinguishes overlapping operations and non-overlapping operations: a linearizable implementation always has the same results as some interleaving, but that interleaving can’t reorder non-overlapping operations. (Specifically, if concurrent threads A and B perform operations α and β, respectively, and α returned before β began, then the result must be the same as some interleaving where α precedes β. This is stricter than serializability, which would accept an interleaving with β before α.)
Unfortunately, linearizable algorithms require expensive instructions, such as memory fences, as a recent paper called “Laws of Order” proves [1]. Expensive instructions are not required in all paths, and commutative functions can avoid them (read the paper or McKenney’s gloss for more [1]), but in practice even read-side code can be required to contain expensive instructions in linearizable implementations.
Fine, but how expensive are those instructions?
Consider a hash table that implements get
and put
operations, where the
keys are integers between 0 and 220–1, and the values are 32-bit
integers. We can implement this “hash table” very simply using an array:
class hash {
static const size_t size = 1 << 20;
volatile uint32_t _data[size];
public:
uint32_t get(int i) const {
return _data[i];
}
void put(int i, uint32_t x) {
_data[i] = x;
}
};
I exercised this unsynchronized hash table using a workload with two
threads, each running 228 operations, 75% get
and 25% put
.
(The machine is a lightly-loaded 4-core 2.66GHz Intel Core i5 with 256KB L2
cache per core; the compiler gcc -O3
.) It took 1.12s to complete, 0.436s
of which is test overhead; so the hash table cost is roughly
2.54ns/operation.
But this unsynchronized hash table is not linearizable. The compiler is
free to reorder operations, and so is the memory model. So let’s try a
version where put
has a compiler fence, to prevent compiler reordering:
void put(int i, uint32_t x) {
_data[i] = x;
asm("" : : : "memory");
}
The hash table cost is roughly the same: 2.59ns/operation.
But this isn’t linearizable either! The x86 memory model is unacceptable for a linearizable hash table, since linearizability requires that all processors see the effects of all writer operations in the exact same order. To make the store globally visible, we need a fence instruction:
void put(int i, uint32_t x) {
_data[i] = x;
asm("mfence" : : : "memory");
}
The cost per operation is more than 6x higher at 16.97ns/operation.
On x86 this suffices to make the hash table linearizable. (The x86-TSO
model, a simple operational semantics for x86 memory based on
extensive experiments with many processor models, makes this clear
[2]: the put
fence will evict all other cores’ versions of the
modified cache line, so later get
s will return the most-up-to-date data.)
On other machines, however, a get
operation might return cached data,
older than a previous nonoverlapping put
, even with the barrier. (For an
example of why this can happen, see the section on “invalidate queues” in
Paul McKenney’s “Memory Barriers: A Hardware View for Software Hackers”
[3].) On such architectures (I believe ARM and POWER among
them), truly linearizable code would require a fence in get
operations
too, something like:
uint32_t get(int i) const {
asm("mfence" : : : "memory");
return _data[i];
}
For fun I measured the x86 version of this; the additional fences were even more expensive, at 28.15ns/operation—more than 10x worse than the compiler-fence version.
This is a very aggressive example. Operations more complex than get
and
put
would partially hide the cost of fences, and more complex write-side
code than put
usually requires expensive instructions to provide
writer–writer locking. But the point should be clear: linearizability can
be expensive.
The matter would end there if linearizability mattered. But does it really matter? In particular, does it matter in an operating system context?
To make this concrete, consider two threads, A and B, that each call read
on a shared file descriptor. Say that A makes its call before B, like so:
(1) A: call read()
=> A returns
B: call read()
=> B returns
Then linearizability requires A’s call to happen before B’s. For instance, if the fd has exactly one byte of data left, then A has to get the byte. On the other hand, if the calls looked like this—overlapped:
(2) A: call read()
B: call read()
=> A returns
=> B returns
then either A or B would be allowed to get the byte.
But wait a minute. In example (1) we, as omniscient observers, can tell that A returned before B began. But can the program tell this? Well, imagine that immediately after the kernel returns to A, processor A took a network interrupt. This interrupt could delay A’s first post-return instruction until later, like so:
(3) A: call read()
=> A set up for return
NETWORK INTERRUPT
B: call read()
NETWORK INTERRUPT DONE
-> A executes post-return
instruction
=> B returns
Now, look what we’ve done. By adding a network interrupt, we have made execution (1) look exactly like execution (2). This shows that from the point of view of the process, it may be impossible to distinguish execution (1) from execution (2). And this, in turn, means that the linearizability requirement is often overkill. Why enforce an expensive correctness condition that distinguishes overlapping operations from non-overlapping operations when the program can’t tell whether operations overlapped?
In practice, what we want is something like linearizability, but not as fussy. If operation α happens just a tiny bit before operation β on a different thread, then we probably don’t care about their order (because the program often couldn’t distinguish their order). But if α happened a while before β we definitely shouldn’t reorder them. A similar argument applies to more than two threads: it often doesn’t matter if threads A and B disagree on the order in which two events happened, as long as those events happened very close together. Unfortunately, this fuzzy linearizability notion isn’t (to my knowledge) well defined or studied, and most people just stick with normal linearizability, semi-pointless and expensive though it may be.
Read-copy update steers between these rocks with a well-chosen and efficient design. We study RCU so we can understand the concurrency issues and to learn the highest performance concurrent programming technique I know of.
Here’s one view of RCU’s basic ideas:
There are many RCU APIs; here are the basics.
rcu_assign_pointer(p, x)
: “publish”. Like p = x
, but ensures that any read-side code that observes the assignment (that sees x
) will observe all prior assignments.rcu_dereference(p)
: Like *p
, but coordinates with rcu_assign_pointer
. If *p
returns a value assigned by rcu_assign_pointer
, then later memory accesses (with or without rcu_dereference
) will see any changes preceding the rcu_assign_pointer
.rcu_read_lock()
, rcu_read_unlock()
: delimit blocks of read-side code.synchronize_rcu()
: wait-for-readers barrier. Waits at least until all concurrent readers have completed.synchronize_rcu
exist that batch reclamation operations and avoid blocking write-side code.mb()
: memory barrier (not RCU specific)What these might compile to on x86 with a non-preemptive kernel:
rcu_assign_pointer(p, x)
= p = x
rcu_dereference(p)
= *p
rcu_read_lock()
= rcu_read_unlock()
= NOTHING (on preemptive, these operations might disable preemption; there are other choices)synchronize_rcu()
= wait for all CPUs but this one to schedule a new task
synchronize_rcu
in the write-side code and instead delegate the post-synchronize
operation to a separate “thread.” This is particularly appropriate if the only thing following synchronize
is memory reclamation.Note that all read-side operations compile to the fastest possible code: it is indistinguishable from unsynchronized code. That’s why RCU wins.
Not all data structures are amenable to the constrained access order reasoning that underlies RCU. Linked lists are a perfect match; balanced trees, for example, are harder. I believe part of the real future of parallel programming is developing RCU-techniques for advanced data structures. This will require creative use of the RCU API, as well as additions to the API. That’s why the hash table paper [4] is exciting.
RCU-like “garbage collection” techniques have been implemented several times over the years (see [5] for local interest). RCU still feels new, however. My best guess for why: RCU first recognized the design point that combines access-order reasoning with loose memory models.
Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Martin Vechev, “Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated”, in Proc. POPL ’11, Jan. 2011 (via ACM Digital Library—Paul McKenney’s gloss on the paper)
“A better x86 memory model: x86-TSO (extended version)”, Scott Owens, Susmit Sarkar, and Peter Sewell, Technical Report UCAM-CL-TR–745, 2010 (versions also published in TPHOLs 2009 and Communications of the ACM 53(7), May 2010) (authors’ tech report version—authors’ project page)
“Memory Barriers: A Hardware View for Software Hackers”, Paul E. McKenney, 2010
Josh Triplett, Paul E. McKenney, and Jonathan Walpole, “Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming”, in Proc. USENIX ATC 2011, June 2011
H. T. Kung and Philip L. Lehman, “Concurrent manipulation of binary search trees”, ACM Transactions on Database Systems 5(3), Sept. 1980 (via ACM Digital Library)