Notes on Read-Copy Update

Read-copy update, or RCU, is a technique for implementing efficient read-mostly concurrent data structures on cache-coherent multiprocessor machines. RCU was originally developed for the DYNIX/ptx operating system, and then found its way into IBM’s K42 research operating system for cache-coherent multiprocessors. RCU is now most deployed in Linux, and one of RCU’s developers, Paul E. McKenney, works on RCU APIs for the Linux kernel. The RCU approach has grown into a larger project called Relativistic Programming.

A concurrent data structure is designed to operate safely under concurrent access. RCU is one of the best concurrent data structure techniques there is, because RCU makes uses cache lines and modern memory systems so wisely. As a result, RCU read-side code often performs as fast as unsafe, unsynchronized code—a wonderful property that many concurrent data structures do not share. But it’s not exactly easy to understand why, and why other concurrent data structures can fall short.

Read-side and write-side

RCU distinguishes between read-side code and write-side code. Read-side code does not modify the data structure; write-side code does. (Data structure operations are classified as either readers and writers, and we sometimes use this terminology for short, but it’s important to note that writers can contain read-side code. For example, a writer operation that removes an item from a tree may first execute read-side code to find the item.)

Synchronization is required to avoid conflicts between write-side code and either read-side or write-side code. RCU is especially concerned with efficient management of read–write conflicts. To manage write–write conflicts, most RCU data structures use regular locking.

Cache makes read/write locking expensive

Read-side concurrent code can run as fast as unsynchronized code only when it uses memory in the same way as unsynchronized code. This means that:

These requirements basically eliminate read/write locking from consideration. In the simplest read/write locks, which maintain a precise count of the number of concurrent readers, read-side operations must perform atomic writes to shared data structures. Faster read/write locks are possible, for example by allocating a separate “private” cache line per processor for the lock, but are hard to program without wasting memory space and/or time.

Weak memory models make linearizability expensive

It is natural to imagine that multiprocessor machines are serializable (or, equivalently, sequentially consistent). This would mean that any set of threads executing on a multiprocessor would always have the same effect as some interleaving of those threads running on a uniprocessor. For example, threads A and B running concurrently, like so,

A1    B1
A2    B2

would produce the same effect as one of the six interleavings that respect program order:

  1. A1 A2 B1 B2
  2. A1 B1 A2 B2
  3. A1 B1 B2 A2
  4. B1 A1 A2 B2
  5. B1 A1 B2 A2
  6. B1 B2 A1 A2

But modern machines just don’t work this way, because enforcing serializability is incredibly slow. Instead, caches and store buffers generally delay when a processor’s writes become visible to all other processors. This delay cannot be modeled by interleaving: the program above might produce some weird result.

But how does this affect us C and C++ programmers?

Modern high level languages include memory models that define how different threads’ assignment statements and variable references can interact. Memory models constrain how compilers and interpreters can choose instructions. For instance, if language X has a sequentially consistent memory model, then X’s compiler must insert expensive fences around most shared memory accesses. The goal is to ensure that any execution of the compiled program would be allowed by the memory model.

Well, modern C and C++ define a relaxed memory model that gives compilers maximum flexibility. This means in practice that C and C++ expose all the weirdness of modern machines’ memory systems, plus even more, since C and C++ compilers can introduce weirdnesses of their own!

Many published concurrent data structure algorithms assume a serializable memory model. That is, the algorithms will work as presented only on serializable machines, or languages with sequentially consistent memory models. But such machines don’t really exist any more, and such languages are slow!

Serializability example

Does this mattern in practice? Well, consider the following operations, executing on two x86 cores:

A1. mov $1, x           B1. mov $2, y
A2. mov x, %eax         B2. mov x, %eax
A3. mov y, %ebx         B3. mov y, %ebx

There are 20 interleavings of these instructions, but those interleavings can produce only 3 distinct outcomes:

    Operations           x    y   A: %eax %ebx  B: %eax %ebx
 1. A1 A2 A3 B1 B2 B3    1    2        1    0        1    2

 2. A1 A2 B1 A3 B2 B3    1    2        1    2        1    2
 3. A1 A2 B1 B2 A3 B3    1    2        1    2        1    2
 4. A1 A2 B1 B2 B3 A3    1    2        1    2        1    2
 5. A1 B1 A2 A3 B2 B3    1    2        1    2        1    2
 6. A1 B1 A2 B2 A3 B3    1    2        1    2        1    2
 7. A1 B1 A2 B2 B3 A3    1    2        1    2        1    2
 8. A1 B1 B2 A2 A3 B3    1    2        1    2        1    2
 9. A1 B1 B2 A2 B3 A3    1    2        1    2        1    2
10. A1 B1 B2 B3 A2 A3    1    2        1    2        1    2
11. B1 A1 A2 A3 B2 B3    1    2        1    2        1    2
12. B1 A1 A2 B2 A3 B3    1    2        1    2        1    2
13. B1 A1 A2 B2 B3 A3    1    2        1    2        1    2
14. B1 A1 B2 A2 A3 B3    1    2        1    2        1    2
15. B1 A1 B2 A2 B3 A3    1    2        1    2        1    2
16. B1 A1 B2 B3 A2 A3    1    2        1    2        1    2

17. B1 B2 A1 A2 A3 B3    1    2        1    2        0    2
18. B1 B2 A1 A2 B3 A3    1    2        1    2        0    2
19. B1 B2 A1 B3 A2 A3    1    2        1    2        0    2
20. B1 B2 B3 A1 A2 A3    1    2        1    2        0    2

That’s on a sequentially consistent x86. But what about in reality? On x86-class machines, all processors eventually agree on the order that stores commit (a property called “Total Store Order”). But there is one important exception: a processor sees its own stores before those stores become globally visible to other processors. This exception means that x86 allows a different outcome, one that cannot be achieved by any serial order. Specifically:

xx. .. .. .. .. .. ..    1    2        1    0        0    2

So x86 store forwarding allows A to see its assignment to x before B’s assignment to y, while B sees the assignments happening in the other order. But in any sequential order, either A’s assignment happens first or B’s does.

Note that this weird outcome happens on x86, whose memory model is relatively easy to understand. Architectures like Power and ARM have even more relaxed semantics: one core’s stores can become visible in different orders to different cores!

Linearizability

OK, relaxed memory models can’t be avoided. But serializability still seems much easier to understand. Many researchers and implementers have therefore believed that concurrent data structures must always provide serializable semantics: that non-serializable semantics is somehow “wrong.” Their goal is to implement high-level operations, like hash table put and get, or balanced tree operations, that have serializable effects. The implementations for these operations might require expensive instructions, but hopefully these can be minimized.

But the de facto standard correctness condition for concurrent data structures is actually even stronger than serializability. It’s called linearizability. Linearizability distinguishes overlapping operations and non-overlapping operations: a linearizable implementation always has the same results as some interleaving, but that interleaving can’t reorder non-overlapping operations. (Specifically, if concurrent threads A and B perform operations α and β, respectively, and α returned before β began, then the result must be the same as some interleaving where α precedes β. This is stricter than serializability, which would accept an interleaving with β before α.)

Unfortunately, linearizable algorithms require expensive instructions, such as memory fences, as a recent paper called “Laws of Order” proves [1]. Expensive instructions are not required in all paths, and commutative functions can avoid them (read the paper or McKenney’s gloss for more [1]), but in practice even read-side code can be required to contain expensive instructions in linearizable implementations.

The cost of linearizability

Fine, but how expensive are those instructions?

Consider a hash table that implements get and put operations, where the keys are integers between 0 and 220–1, and the values are 32-bit integers. We can implement this “hash table” very simply using an array:

class hash {
    static const size_t size = 1 << 20;
    volatile uint32_t _data[size];
  public:
    uint32_t get(int i) const {
        return _data[i];
    }
    void put(int i, uint32_t x) {
        _data[i] = x;
    }
};

I exercised this unsynchronized hash table using a workload with two threads, each running 228 operations, 75% get and 25% put. (The machine is a lightly-loaded 4-core 2.66GHz Intel Core i5 with 256KB L2 cache per core; the compiler gcc -O3.) It took 1.12s to complete, 0.436s of which is test overhead; so the hash table cost is roughly 2.54ns/operation.

But this unsynchronized hash table is not linearizable. The compiler is free to reorder operations, and so is the memory model. So let’s try a version where put has a compiler fence, to prevent compiler reordering:

    void put(int i, uint32_t x) {
        _data[i] = x;
        asm("" : : : "memory");
    }

The hash table cost is roughly the same: 2.59ns/operation.

But this isn’t linearizable either! The x86 memory model is unacceptable for a linearizable hash table, since linearizability requires that all processors see the effects of all writer operations in the exact same order. To make the store globally visible, we need a fence instruction:

    void put(int i, uint32_t x) {
        _data[i] = x;
        asm("mfence" : : : "memory");
    }

The cost per operation is more than 6x higher at 16.97ns/operation.

On x86 this suffices to make the hash table linearizable. (The x86-TSO model, a simple operational semantics for x86 memory based on extensive experiments with many processor models, makes this clear [2]: the put fence will evict all other cores’ versions of the modified cache line, so later gets will return the most-up-to-date data.) On other machines, however, a get operation might return cached data, older than a previous nonoverlapping put, even with the barrier. (For an example of why this can happen, see the section on “invalidate queues” in Paul McKenney’s “Memory Barriers: A Hardware View for Software Hackers” [3].) On such architectures (I believe ARM and POWER among them), truly linearizable code would require a fence in get operations too, something like:

    uint32_t get(int i) const {
        asm("mfence" : : : "memory");
        return _data[i];
    }

For fun I measured the x86 version of this; the additional fences were even more expensive, at 28.15ns/operation—more than 10x worse than the compiler-fence version.

This is a very aggressive example. Operations more complex than get and put would partially hide the cost of fences, and more complex write-side code than put usually requires expensive instructions to provide writer–writer locking. But the point should be clear: linearizability can be expensive.

Linearizability is not always distinguishable

The matter would end there if linearizability mattered. But does it really matter? In particular, does it matter in an operating system context?

To make this concrete, consider two threads, A and B, that each call read on a shared file descriptor. Say that A makes its call before B, like so:

(1) A: call read()
       => A returns
                      B: call read()
                         => B returns

Then linearizability requires A’s call to happen before B’s. For instance, if the fd has exactly one byte of data left, then A has to get the byte. On the other hand, if the calls looked like this—overlapped:

(2) A: call read()
                      B: call read()
       => A returns
                         => B returns

then either A or B would be allowed to get the byte.

But wait a minute. In example (1) we, as omniscient observers, can tell that A returned before B began. But can the program tell this? Well, imagine that immediately after the kernel returns to A, processor A took a network interrupt. This interrupt could delay A’s first post-return instruction until later, like so:

(3) A: call read()
       => A set up for return
    NETWORK INTERRUPT
                                    B: call read()
    NETWORK INTERRUPT DONE
       -> A executes post-return
          instruction
                                       => B returns

Now, look what we’ve done. By adding a network interrupt, we have made execution (1) look exactly like execution (2). This shows that from the point of view of the process, it may be impossible to distinguish execution (1) from execution (2). And this, in turn, means that the linearizability requirement is often overkill. Why enforce an expensive correctness condition that distinguishes overlapping operations from non-overlapping operations when the program can’t tell whether operations overlapped?

In practice, what we want is something like linearizability, but not as fussy. If operation α happens just a tiny bit before operation β on a different thread, then we probably don’t care about their order (because the program often couldn’t distinguish their order). But if α happened a while before β we definitely shouldn’t reorder them. A similar argument applies to more than two threads: it often doesn’t matter if threads A and B disagree on the order in which two events happened, as long as those events happened very close together. Unfortunately, this fuzzy linearizability notion isn’t (to my knowledge) well defined or studied, and most people just stick with normal linearizability, semi-pointless and expensive though it may be.

Read-Copy Update

Read-copy update steers between these rocks with a well-chosen and efficient design. We study RCU so we can understand the concurrency issues and to learn the highest performance concurrent programming technique I know of.

Here’s one view of RCU’s basic ideas:

RCU operations

There are many RCU APIs; here are the basics.

What these might compile to on x86 with a non-preemptive kernel:

Note that all read-side operations compile to the fastest possible code: it is indistinguishable from unsynchronized code. That’s why RCU wins.

Caveats

Not all data structures are amenable to the constrained access order reasoning that underlies RCU. Linked lists are a perfect match; balanced trees, for example, are harder. I believe part of the real future of parallel programming is developing RCU-techniques for advanced data structures. This will require creative use of the RCU API, as well as additions to the API. That’s why the hash table paper [4] is exciting.

History

RCU-like “garbage collection” techniques have been implemented several times over the years (see [5] for local interest). RCU still feels new, however. My best guess for why: RCU first recognized the design point that combines access-order reasoning with loose memory models.


  1. Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Martin Vechev, “Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated”, in Proc. POPL ’11, Jan. 2011 (via ACM Digital LibraryPaul McKenney’s gloss on the paper)

  2. “A better x86 memory model: x86-TSO (extended version)”, Scott Owens, Susmit Sarkar, and Peter Sewell, Technical Report UCAM-CL-TR–745, 2010 (versions also published in TPHOLs 2009 and Communications of the ACM 53(7), May 2010) (authors’ tech report versionauthors’ project page)

  3. “Memory Barriers: A Hardware View for Software Hackers”, Paul E. McKenney, 2010

  4. Josh Triplett, Paul E. McKenney, and Jonathan Walpole, “Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming”, in Proc. USENIX ATC 2011, June 2011

  5. H. T. Kung and Philip L. Lehman, “Concurrent manipulation of binary search trees”, ACM Transactions on Database Systems 5(3), Sept. 1980 (via ACM Digital Library)