Lecture 18: Fair locks and RCU

Fairness

Count number of times a lock is obtained multiple times consecutively by the same thread

2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)

More data
- >96% of lock acquisitions alternate with another thread
- But >0.1% of lock acquisitions occur in a consecutive sequence of >28 acquisitions by the same thread

Solving fairness

Want a mechanism to serve lock requests in order
Must define an order

Ticket lock

struct ticket_lock {
    std::atomic<unsigned> now_ = 0;
    std::atomic<unsigned> next_ = 0;

    void lock() {
        unsigned me = next_++;
        while (me != now_) {
            pause();
        }
    }
    void unlock() {
        now_++;
    }
};

Sleeping?

MCS (Mellor-Crummey Scott) lock

Link; Shorter earlier version
Early high-performance fair lock, good at high contention
Data structure: A queue of waiting threads, implemented as a singly-linked list
- Shared lock points at the tail of the list (the thread that will get the lock last, if any)
- Threads spin not on shared state, but on local state, which makes spinning much cheaper (no hammering the cache coherence protocol)
Used widely in Java, variant proposed for Linux (used narrowly due to high default overhead)

struct mcs_lock {
    struct qentry {    // user must declare one of these to lock
        std::atomic<qentry*> next;
        std::atomic<bool> blocked;
    };

    std::atomic<qentry*> lk_;   // points at tail

    void lock(qentry& w) {
        w.next = nullptr;
        qentry* prev_tail = lk_.exchange(&w);  // mark self as tail
        if (prev_tail) {           // previous tail exists
            w.blocked = true;
            prev_tail->next = &w;  // link to previous tail
            while (w.blocked) {
                pause();
            }
        }
    }
    void unlock(qentry& w) {
        qentry* expected_tail = &w;
        if (!w.next
            && lk_.compare_exchange_weak(expected_tail, nullptr)) {
            return;  // no one else is waiting
        }
        while (!w.next) {  // wait for next tail to link self
            pause();
        }
        w.next->blocked = false;
    }
}

// some function that uses a lock `l`
f() {
    ...
    mcs_lock::qentry w;
    l.lock(w);
    ...
    l.unlock(w);
    ...
}

About MCS

Much more overhead!
- Atomic operations and spins in both lock and unlock
- Higher latency, lower throughput
- Relatively high space overhead
Advantages to this overhead
- Fairness
- Each waiter is spinning on private state, not shared state!
- Fewer cache line conflicts
How does it work?
- If atomic swap (std::atomic<T>::exchange) in lock returns non-nullptr, w is appended to linked list and spins until it is notified that it is no longer locked.
- unlock waits until there is a next waiter in the list, and unlocks it.
MCS lock is optimized for high contention, but in that case you already have performance issues.

Blocking in locks

Ideally, we’d like a thread blocked on a lock to sleep
- Lower overhead on high contention
- Supports higher thread counts
But requires kernel involvement
- Pipes?
- Kernel-managed lock abstractions?
- Some new primitive?

Mutex goals

Goal: low overhead at low contention, fairness at high contention
Goal: block on sleep (requires kernel communication)
Should lock be a system call?
- No! System calls are expensive

Futex

Two-phase design
Phase 1
- Assume low contention
- No kernel involvement
- User enforces FIFO if desired
Phase 2
- After observing contention
- Kernel involvement: user blocks
- Kernel enforces FIFO
- (Except a thread can get lucky, and threads can have different priority, etc.)
User can enforce lock order
References
- The futex(2) manual page
- “Futexes Are Tricky”, by Ulrich Drepper
- A futex overview and update by Darren Hart
- Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015
- “Locking in WebKit” by Filip Pizlo—not about futexes, but a very clear explanation of this two-phase structure

Futex example (Drepper’s “Mutex #3”)

struct futex_lock {
    std::atomic<int> val_;
    // 0 = unlocked; 1 = locked, no futex waiters;
    // 2 = locked, maybe futex waiters

    void lock() {
        // phase 1
        for (unsigned i = 0; i < 40; i++) {
            int expected = 0;
            if (val_.compare_exchange_weak(expected, 1)) {
                return;
            }
            sched_yield();
        }

        // phase 2
        int previous = val_.load(std::memory_order_relaxed);
        if (previous != 2) {
            previous = val_.exchange(2);
        }
        while (previous != 0) {
            futex(&val_, FUTEX_WAIT, 2);
            previous = val_.exchange(2);
        }
    }

    void unlock() {
        if (--val_ != 0) {    // atomic decrement
            val_ = 0;
            futex(&val_, FUTEX_WAKE, 1);
        }
    }
};

With low contention kernel does not get involved; with high contention threads block
40 user level tries seems optimal for some reason
Switching to a similar strategy increased performance by 10% in WebKit

Readers/writer locks

These are good when reads are much more common than writes. There can be any number of readers or a single writer.

The lock has three states

unlocked (val_ == 0)
read locked (val_ > 0)
write locked (val_ == -1)

struct rw_lock {
    std::atomic<int> val_;

    void lock_read() {
        int expected = val_;
        while (expected < 0
               || !val_.compare_exchange_weak
                        (expected, expected + 1)) {
            pause();
            expected = val_;
        }
    }
    void unlock_read() {
        --val_;
    }

    void lock_write() {
        int expected = 0;
        while (!val_.compare_exchange_weak(expected, -1)) {
            pause();
            expected = 0;
        }
    }
    void unlock_write() {
        val_ = 0;
    }
};

This readers/writer lock is not fair (it starves writers).

rwlocks: Reducing memory contention

Give each thread its own val_

struct rw_lock_2 {
    spinlock f_[NCPU];     // would really want separate cache lines

    void lock_read() {
        f_[this_cpu()].lock();
    }
    void unlock_read() {
        f_[this_cpu()].unlock();
    }

    void lock_write() {
        for (unsigned i = 0; i != NCPU; ++i) {
            f_[i].lock();
        }
    }
    void unlock_write() {
        for (unsigned i = 0; i != NCPU; ++i) {
            f_[i].unlock();
        }
    }
};

This makes reads fast and writes fairly slow

Write balance in practice

Many structures in practice have extremely unbalanced workloads
- Many more reads than writes
- For some structures (e.g., OS configuration), millions of times more
Cost of rwlocks?
Can we reduce the cost of reads even further?

RCU (Read-Copy Update)

Goal: zero-op read locks
Surprisingly, this ends up becoming a garbage collection problem!

Zero-op read locks: the simple case

Let’s say we wanted to read............................

an integer

Reading and writing integers safely

Use atomics!
std::memory_order_relaxed indicates that serial orders are not required

std::atomic<int> val;

int read() {
    return val.load(std::memory_order_relaxed);
}

void write(int x) {
    val.store(x, std::memory_order_relaxed);
}

Commonly want more exclusion on writes

std::atomic<int> val;
spinlock val_lock;

int read() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    ... compute with current `val`; other writers are excluded ...
    val.store(x, std::memory_order_relaxed);
    val_lock.unlock();
}

Works for ≤8-byte objects that are aligned (not crossing cache line boundaries)
- What about larger objects?

Idea: Use atomics to name versions

Read: Obtain a pointer to the version
- That version will never change
Write: Create a new version, install a pointer to it

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;

bigvalue* read_snapshot() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    bigvalue* newval = new bigvalue;
    ... compute with `val`; initialize `newval` ...
    val.store(newval, std::memory_order_relaxed);
    val_lock.unlock();
}

Memory allocation and RCU

When can we delete the old version?
This is a garbage collection problem!
Don’t want a full garbage collector
- Even highly concurrent garbage collectors can “stop the world” for some fraction of their runtime
- Explicit memory deallocation is more efficient
- Timely memory deallocation can be important to avoid running out of memory

Idea: Epoch-based reclamation

Track “read-side critical sections”
- Sort of like a read lock
- Sort of like a reference
- But not associated with specific data
- Locks the existence of all objects reachable at a certain time
- Cheaper, lower-overhead than explicit data-associated locks

Sketch

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
std::deque<std::pair<bigvalue*, time_t>> val_garbage;
time_t current_epoch;
time_t read_epochs[NCPU];

void read() {
    // start read-side critical section
    read_epochs[this_cpu()] = current_epoch;

    bigvalue* v = val.load(std::memory_order_relaxed);
    ... use `v` arbitrarily ...

    // mark completion
    read_epochs[this_cpu()] = 0;

    ... MUST NOT refer to `v` (it might be freed!) ...
}

void modify() {
    val_lock.lock();
    bigvalue* oldval = val;
    bigvalue* newval = new bigvalue;
    ... compute, initialize ...
    val.store(newval);
    val_lock.unlock();

    val_garbage.push_back({oldval, current_epoch});
    // NB should lock `val_garbage`
}

void gc() {
    // run periodically
    time_t garbage_epoch = min(read_epochs); // ignoring zeroes
    ++current_epoch;
    free all `val_garbage` older than `garbage_epoch`;
}

Must ensure that all snapshots obtained in a read-side critical section exist until the read-side critical section completes
Garbage rendered unreachable at time T must not be freed until all read-side critical sections active at time T complete
Solution: Reading threads record the current time (current_epoch) in a global variable (read_epoch) when beginning a read-side critical section
Garbage is stored in a list, val_garbage
- Each piece of garbage is accompanied by an epoch initialized with current_epoch
Garbage is freed based on the minimum read_epoch of any reading thread
- Call this garbage_epoch
- Can free garbage with epoch T when T < garbage_epoch
Proof sketch
- Garbage recorded with epoch T was rendered unreachable during epoch T
- That means that only threads with read_epoch <= T could observe the garbage
- If all threads have read_epoch > T, then no thread is observing the garbage now or will in future
- So garbage is safe to free

Other mechanisms

Epoch-based reclamation is fast, other mechanisms can be even faster
Quiescent-state based reclamation
References
- “RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link
  
  Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.
- “Performance of memory reclamation for lockless synchronization.” Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. Link
  
  Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use.
  
  The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance.
  
  We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.