Lecture 23: Synchronization III

Blocking in locks

Ideally, we’d like a thread blocked on a lock to sleep
- Lower overhead on high contention
- Supports higher thread counts
But requires kernel involvement
- Only the kernel can block!
- Pipes?
- Kernel-managed lock abstractions?
- Some new primitive?

Mutex goals

Goal: low overhead at low contention, fairness at high contention
Goal: block on sleep (requires kernel communication)
Should lock be a system call?
- No! System calls are expensive

Futex

Two-phase design
Phase 1
- Assume low contention
- No kernel involvement
- User enforces FIFO if desired
Phase 2
- After observing contention
- Kernel involvement: user blocks
- Kernel enforces FIFO
- (Except a thread can get lucky, and threads can have different priority, etc.)
User can enforce lock order
References
- The futex(2) manual page
- “Futexes Are Tricky”, by Ulrich Drepper
- A futex overview and update by Darren Hart
- Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015
- “Locking in WebKit” by Filip Pizlo—not about futexes, but a very clear explanation of this two-phase structure

Futex example (Drepper’s “Mutex #3”)

struct futex_lock {
    std::atomic<int> val_;
    // 0 = unlocked; 1 = locked, no futex waiters;
    // 2 = locked, maybe futex waiters

    void lock() {
        // phase 1
        for (unsigned i = 0; i != 40; ++i) {
            int expected = 0;
            if (val_.compare_exchange_weak(expected, 1)) {
                return;
            }
            sched_yield();
        }

        // phase 2
        int previous = val_.load(std::memory_order_relaxed);
        if (previous != 2) {
            previous = val_.exchange(2);
        }
        while (previous != 0) {
            futex(&val_, FUTEX_WAIT, 2);
            previous = val_.exchange(2);
        }
    }

    void unlock() {
        if (--val_ != 0) {    // atomic decrement
            val_ = 0;
            futex(&val_, FUTEX_WAKE, 1);
        }
    }
};

With low contention kernel does not get involved; with high contention threads block
40 user level tries has seemed optimal for some reason
Switching to a similar strategy increased performance by 10% in WebKit

Futex system call semantics

In the example, we saw two calls to futex
- In lock(): futex(&val_, FUTEX_WAIT, 2)
  - “Block until val_ != 2”
- In unlock(): futex(&val_, FUTEX_WAKE, 1)
  - “Wake up one waiter blocked on val_”
How does the lock avoid unnecessary system calls?
How does the lock avoid race conditions like lost wakeups (i.e., how does it make all necessary system calls)?
Can the lock demonstrate any unexpected or unusual behaviors?

Futex treasure hunt

Describe an execution (a sequence of lock & unlock operations, attributed to specific threads) where val_ never becomes 2
Describe an execution where val_ takes values 0 → 1 → 2 → 1 → 0 → 2 → 1 → 0, and say where the futex system calls occur
Find an optimization in the code that avoids an atomic operation in the common case (i.e., code that could be removed without breaking semantics)
Describe a missing optimization that could avoid useless atomic operations on a contended lock

If the lock is never acquired, then val_ is always 0! Or, if the lock is acquired once once and then never released, it takes values 0 → 1. Any uncontended execution—where every attempt to acquire the lock happens when the lock is unlocked—will alternate between values 0 and 1.

Here’s an example sequence:

Thread Lock holder Description val_ futex calls

0. – The lock is initially unlocked 0

1. A A Calls lock();
val_.compare_exchange_weak succeeds 1

2. B A Calls lock();
val_.compare_exchange_weak fails 40x

3. B A val_.exchange(2) 2

4. B A Blocks futex(&val_, FUTEX_WAIT, 2)

5. A A Calls unlock(); --val_ 1

6. A – Sets val_ = 0 0 futex(&val_, FUTEX_WAKE, 1)

7. B B Unblocks, performs val_.exchange(2);
acquires lock because previous == 0 2

8. B B Calls unlock(); --val_ 1

9. B – Sets val_ = 0 0 futex(&val_, FUTEX_WAKE, 1)
The entire “Phase 1” loop is an optimization! Also an optimization is this, at the start of Phase 2:
int previous = val_.load(std::memory_order_relaxed);
if (previous != 2) {
    previous = val_.exchange(2);
}
The following would be equivalent. The optimized version, however, avoids an atomic memory write in the expected case at this point in the code: Phase 2 is reached when the lock is contended, and if the lock is contended val_ is usually 2.
int previous = val_.exchange(2);
Arguably, in Phase 1, if val_’s previous value is observed to be 2, then the locking thread should exit the Phase 1 loop and go straight to Phase 2.
for (unsigned i = 0; i < 40; ++i) {
    int expected = 0;
    if (val_.compare_exchange_weak(expected, 1)) {
        return;
    } else if (expected == 2) {
        break;
    }
    sched_yield();  // or `pause()`
}

	Thread	Lock holder	Description	`val_`	`futex` calls
0.		–	The lock is initially unlocked	0
1.	A	A	Calls `lock()`; `val_.compare_exchange_weak` succeeds	1
2.	B	A	Calls `lock()`; `val_.compare_exchange_weak` fails 40x
3.	B	A	`val_.exchange(2)`	2
4.	B	A	Blocks		`futex(&val_, FUTEX_WAIT, 2)`
5.	A	A	Calls `unlock()`; `--val_`	1
6.	A	–	Sets `val_ = 0`	0	`futex(&val_, FUTEX_WAKE, 1)`
7.	B	B	Unblocks, performs `val_.exchange(2)`; acquires lock because `previous == 0`	2
8.	B	B	Calls `unlock()`; `--val_`	1
9.	B	–	Sets `val_ = 0`	0	`futex(&val_, FUTEX_WAKE, 1)`

Futex code (repeated)

struct futex_lock {
    std::atomic<int> val_;
    // 0 = unlocked; 1 = locked, no futex waiters;
    // 2 = locked, maybe futex waiters

    void lock() {
        // phase 1
        for (unsigned i = 0; i != 40; ++i) {
            int expected = 0;
            if (val_.compare_exchange_weak(expected, 1)) {
                return;
            }
            sched_yield();   // or `pause()`
        }

        // phase 2
        int previous = val_.load(std::memory_order_relaxed);
        if (previous != 2) {
            previous = val_.exchange(2);
        }
        while (previous != 0) {
            futex(&val_, FUTEX_WAIT, 2);
            previous = val_.exchange(2);
        }
    }

    void unlock() {
        if (--val_ != 0) {    // atomic decrement
            val_ = 0;
            futex(&val_, FUTEX_WAKE, 1);
        }
    }
};

Futex system call implementation

A futex is a 32‐bit value—referred to below as a futex word—whose address is supplied to the futex() system call. (Futexes are 32 bits in size on all platforms, including 64‐bit systems.) All futex operations are governed by this value. … [T]he futex word is used to connect the synchronization in user space with the implementation of blocking by the kernel. Analogously to an atomic compare‐and‐exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare‐and‐block operation.

The kernel’s implementation of the futex system call involves several pieces of state
The futex word is stored in user-accessible memory
- 32-bit value, 32-bit aligned
- All futex operations access the futex word
Futex wait queues are stored in kernel memory
- For each futex word, which user threads, if any, are blocked?
- Blocking operations like FUTEX_WAIT may add a thread to a futex wait queue
- Waking operations like FUTEX_WAKE may remove a thread from a futex wait queue

Futex system call design

Sketch an implementation of the futex system call (WAKE and WAIT only!)
- How will you implement a futex word?
- How will you implement futex wait queues?

Write balance in practice

Many structures in practice have extremely unbalanced workloads
- Many more reads than writes
- For some structures (e.g., OS configuration), millions of times more
Cost of rwlocks?
Can we reduce the cost of reads even further?

RCU (Read-Copy Update)

Goal: zero-op read locks
Surprisingly, this ends up becoming a garbage collection problem!

Zero-op read locks: the simple case

Let’s say we wanted to read............................

an integer

Reading and writing integers safely

Use atomics!
std::memory_order_relaxed indicates that serial orders are not required

std::atomic<int> val;

int read() {
    return val.load(std::memory_order_relaxed);
}

void write(int x) {
    val.store(x, std::memory_order_relaxed);
}

Commonly want more exclusion on writes

std::atomic<int> val;
spinlock val_lock;

int read() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    ... compute with current `val`; other writers are excluded ...
    val.store(x, std::memory_order_relaxed);
    val_lock.unlock();
}

Works for ≤8-byte objects that are aligned (not crossing cache line boundaries)
- What about larger objects?

Idea: Use atomics to name versions

Read: Obtain a pointer to the version
- That version will never change
Write: Create a new version, install a pointer to it

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;

bigvalue* read_snapshot() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    bigvalue* newval = new bigvalue;
    ... compute with `val`; initialize `newval` ...
    val.store(newval, std::memory_order_relaxed);
    val_lock.unlock();
}

Memory allocation and RCU

When can we delete the old version?
This is a garbage collection problem!
Don’t want a full garbage collector
- Even highly concurrent garbage collectors can “stop the world” for some fraction of their runtime
- Explicit memory deallocation is more efficient
- Timely memory deallocation can be important to avoid running out of memory

Idea: Epoch-based reclamation

Track “read-side critical sections”
- Sort of like a read lock
- Sort of like a reference
- But not associated with specific data
- Locks the existence of all objects reachable at a certain time
- Cheaper, lower-overhead than explicit data-associated locks

Sketch

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
std::deque<std::pair<bigvalue*, time_t>> val_garbage;
time_t current_epoch;
time_t read_epochs[NCPU];

void read() {
    // start read-side critical section
    read_epochs[this_cpu()] = current_epoch;

    bigvalue* v = val.load(std::memory_order_relaxed);
    ... use `v` arbitrarily ...

    // mark completion
    read_epochs[this_cpu()] = 0;

    ... MUST NOT refer to `v` (it might be freed!) ...
}

void modify() {
    val_lock.lock();
    bigvalue* oldval = val;
    bigvalue* newval = new bigvalue;
    ... compute, initialize ...
    val.store(newval);
    val_lock.unlock();

    val_garbage.push_back({oldval, current_epoch});
    // NB should lock `val_garbage`
}

void gc() {
    // run periodically
    time_t garbage_epoch = min(read_epochs); // ignoring zeroes
    ++current_epoch;
    free all `val_garbage` older than `garbage_epoch`;
}

Must ensure that all snapshots obtained in a read-side critical section exist until the read-side critical section completes
Garbage rendered unreachable at time T must not be freed until all read-side critical sections active at time T complete
Solution: Reading threads record the current time (current_epoch) in a global variable (read_epoch) when beginning a read-side critical section
Garbage is stored in a list, val_garbage
- Each piece of garbage is accompanied by an epoch initialized with current_epoch
Garbage is freed based on the minimum read_epoch of any reading thread
- Call this garbage_epoch
- Can free garbage with epoch T when T < garbage_epoch
Proof sketch
- Garbage recorded with epoch T was rendered unreachable during epoch T
- That means that only threads with read_epoch <= T could observe the garbage
- If all threads have read_epoch > T, then no thread is observing the garbage now or will in future
- So garbage is safe to free

Other mechanisms

Epoch-based reclamation is fast, other mechanisms can be even faster
Quiescent-state based reclamation
References
- “RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link
  
  Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.
- “Performance of memory reclamation for lockless synchronization.” Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. Link
  
  Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use.
  
  The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance.
  
  We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.
RCU is being standardized in C++ 26!