Blocking in locks
- Ideally, we’d like a thread blocked on a lock to sleep
- Lower overhead on high contention
- Supports higher thread counts
- But requires kernel involvement
- Only the kernel can block!
- Pipes?
- Kernel-managed lock abstractions?
- Some new primitive?
Mutex goals
- Goal: low overhead at low contention, fairness at high contention
- Goal: block on sleep (requires kernel communication)
- Should
lock
be a system call?- No! System calls are expensive
Futex
- Two-phase design
- Phase 1
- Assume low contention
- No kernel involvement
- User enforces FIFO if desired
- Phase 2
- After observing contention
- Kernel involvement: user blocks
- Kernel enforces FIFO
- (Except a thread can get lucky, and threads can have different priority, etc.)
- User can enforce lock order
- References
- The
futex
(2) manual page - “Futexes Are Tricky”, by Ulrich Drepper
- A futex overview and update by Darren Hart
- Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015
- “Locking in WebKit” by Filip Pizlo—not about futexes, but a very clear explanation of this two-phase structure
- The
Futex example (Drepper’s “Mutex #3”)
struct futex_lock {
std::atomic<int> val_;
// 0 = unlocked; 1 = locked, no futex waiters;
// 2 = locked, maybe futex waiters
void lock() {
// phase 1
for (unsigned i = 0; i != 40; ++i) {
int expected = 0;
if (val_.compare_exchange_weak(expected, 1)) {
return;
}
sched_yield();
}
// phase 2
int previous = val_.load(std::memory_order_relaxed);
if (previous != 2) {
previous = val_.exchange(2);
}
while (previous != 0) {
futex(&val_, FUTEX_WAIT, 2);
previous = val_.exchange(2);
}
}
void unlock() {
if (--val_ != 0) { // atomic decrement
val_ = 0;
futex(&val_, FUTEX_WAKE, 1);
}
}
};
- With low contention kernel does not get involved; with high contention threads block
- 40 user level tries has seemed optimal for some reason
- Switching to a similar strategy increased performance by 10% in WebKit
Futex system call semantics
- In the example, we saw two calls to
futex
- In
lock()
:futex(&val_, FUTEX_WAIT, 2)
- “Block until
val_ != 2
”
- “Block until
- In
unlock()
:futex(&val_, FUTEX_WAKE, 1)
- “Wake up one waiter blocked on
val_
”
- “Wake up one waiter blocked on
- In
- How does the lock avoid unnecessary system calls?
- How does the lock avoid race conditions like lost wakeups (i.e., how does it make all necessary system calls)?
- Can the lock demonstrate any unexpected or unusual behaviors?
Futex treasure hunt
-
Describe an execution (a sequence of lock & unlock operations, attributed to specific threads) where
val_
never becomes 2 -
Describe an execution where
val_
takes values 0 → 1 → 2 → 1 → 0 → 2 → 1 → 0, and say where thefutex
system calls occur -
Find an optimization in the code that avoids an atomic operation in the common case (i.e., code that could be removed without breaking semantics)
-
Describe a missing optimization that could avoid useless atomic operations on a contended lock
If the lock is never acquired, then
val_
is always 0! Or, if the lock is acquired once once and then never released, it takes values 0 → 1. Any uncontended execution—where every attempt to acquire the lock happens when the lock is unlocked—will alternate between values 0 and 1.Here’s an example sequence:
Thread Lock holder Description val_
futex
calls0. – The lock is initially unlocked 0 1. A A Calls lock()
;val_.compare_exchange_weak
succeeds1 2. B A Calls lock()
;val_.compare_exchange_weak
fails 40x3. B A val_.exchange(2)
2 4. B A Blocks futex(&val_, FUTEX_WAIT, 2)
5. A A Calls unlock()
;--val_
1 6. A – Sets val_ = 0
0 futex(&val_, FUTEX_WAKE, 1)
7. B B Unblocks, performs val_.exchange(2)
;
acquires lock becauseprevious == 0
2 8. B B Calls unlock()
;--val_
1 9. B – Sets val_ = 0
0 futex(&val_, FUTEX_WAKE, 1)
The entire “Phase 1” loop is an optimization! Also an optimization is this, at the start of Phase 2:
int previous = val_.load(std::memory_order_relaxed); if (previous != 2) { previous = val_.exchange(2); }
The following would be equivalent. The optimized version, however, avoids an atomic memory write in the expected case at this point in the code: Phase 2 is reached when the lock is contended, and if the lock is contended
val_
is usually 2.int previous = val_.exchange(2);
Arguably, in Phase 1, if
val_
’s previous value is observed to be 2, then the locking thread should exit the Phase 1 loop and go straight to Phase 2.for (unsigned i = 0; i < 40; ++i) { int expected = 0; if (val_.compare_exchange_weak(expected, 1)) { return; } else if (expected == 2) { break; } sched_yield(); // or `pause()` }
Futex code (repeated)
struct futex_lock {
std::atomic<int> val_;
// 0 = unlocked; 1 = locked, no futex waiters;
// 2 = locked, maybe futex waiters
void lock() {
// phase 1
for (unsigned i = 0; i != 40; ++i) {
int expected = 0;
if (val_.compare_exchange_weak(expected, 1)) {
return;
}
sched_yield(); // or `pause()`
}
// phase 2
int previous = val_.load(std::memory_order_relaxed);
if (previous != 2) {
previous = val_.exchange(2);
}
while (previous != 0) {
futex(&val_, FUTEX_WAIT, 2);
previous = val_.exchange(2);
}
}
void unlock() {
if (--val_ != 0) { // atomic decrement
val_ = 0;
futex(&val_, FUTEX_WAKE, 1);
}
}
};
Futex system call implementation
A futex is a 32‐bit value—referred to below as a futex word—whose address is supplied to the
futex()
system call. (Futexes are 32 bits in size on all platforms, including 64‐bit systems.) All futex operations are governed by this value. … [T]he futex word is used to connect the synchronization in user space with the implementation of blocking by the kernel. Analogously to an atomic compare‐and‐exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare‐and‐block operation.
- The kernel’s implementation of the
futex
system call involves several pieces of state - The futex word is stored in user-accessible memory
- 32-bit value, 32-bit aligned
- All futex operations access the futex word
- Futex wait queues are stored in kernel memory
- For each futex word, which user threads, if any, are blocked?
- Blocking operations like
FUTEX_WAIT
may add a thread to a futex wait queue - Waking operations like
FUTEX_WAKE
may remove a thread from a futex wait queue
Futex system call design
- Sketch an implementation of the futex system call (
WAKE
andWAIT
only!)- How will you implement a futex word?
- How will you implement futex wait queues?
Write balance in practice
- Many structures in practice have extremely unbalanced workloads
- Many more reads than writes
- For some structures (e.g., OS configuration), millions of times more
- Cost of rwlocks?
- Can we reduce the cost of reads even further?
RCU (Read-Copy Update)
- Goal: zero-op read locks
- Surprisingly, this ends up becoming a garbage collection problem!
Zero-op read locks: the simple case
- Let’s say we wanted to read............................
Reading and writing integers safely
- Use atomics!
std::memory_order_relaxed
indicates that serial orders are not required
std::atomic<int> val;
int read() {
return val.load(std::memory_order_relaxed);
}
void write(int x) {
val.store(x, std::memory_order_relaxed);
}
- Commonly want more exclusion on writes
std::atomic<int> val;
spinlock val_lock;
int read() {
return val.load(std::memory_order_relaxed);
}
void modify() {
val_lock.lock();
... compute with current `val`; other writers are excluded ...
val.store(x, std::memory_order_relaxed);
val_lock.unlock();
}
- Works for ≤8-byte objects that are aligned (not crossing cache line boundaries)
- What about larger objects?
Idea: Use atomics to name versions
- Read: Obtain a pointer to the version
- That version will never change
- Write: Create a new version, install a pointer to it
struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
bigvalue* read_snapshot() {
return val.load(std::memory_order_relaxed);
}
void modify() {
val_lock.lock();
bigvalue* newval = new bigvalue;
... compute with `val`; initialize `newval` ...
val.store(newval, std::memory_order_relaxed);
val_lock.unlock();
}
Memory allocation and RCU
- When can we delete the old version?
- This is a garbage collection problem!
- Don’t want a full garbage collector
- Even highly concurrent garbage collectors can “stop the world” for some fraction of their runtime
- Explicit memory deallocation is more efficient
- Timely memory deallocation can be important to avoid running out of memory
Idea: Epoch-based reclamation
- Track “read-side critical sections”
- Sort of like a read lock
- Sort of like a reference
- But not associated with specific data
- Locks the existence of all objects reachable at a certain time
- Cheaper, lower-overhead than explicit data-associated locks
Sketch
struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
std::deque<std::pair<bigvalue*, time_t>> val_garbage;
time_t current_epoch;
time_t read_epochs[NCPU];
void read() {
// start read-side critical section
read_epochs[this_cpu()] = current_epoch;
bigvalue* v = val.load(std::memory_order_relaxed);
... use `v` arbitrarily ...
// mark completion
read_epochs[this_cpu()] = 0;
... MUST NOT refer to `v` (it might be freed!) ...
}
void modify() {
val_lock.lock();
bigvalue* oldval = val;
bigvalue* newval = new bigvalue;
... compute, initialize ...
val.store(newval);
val_lock.unlock();
val_garbage.push_back({oldval, current_epoch});
// NB should lock `val_garbage`
}
void gc() {
// run periodically
time_t garbage_epoch = min(read_epochs); // ignoring zeroes
++current_epoch;
free all `val_garbage` older than `garbage_epoch`;
}
- Must ensure that all snapshots obtained in a read-side critical section exist until the read-side critical section completes
- Garbage rendered unreachable at time T must not be freed until all read-side critical sections active at time T complete
- Solution: Reading threads record the current time (
current_epoch
) in a global variable (read_epoch
) when beginning a read-side critical section - Garbage is stored in a list,
val_garbage
- Each piece of garbage is accompanied by an epoch initialized with
current_epoch
- Each piece of garbage is accompanied by an epoch initialized with
- Garbage is freed based on the minimum
read_epoch
of any reading thread- Call this
garbage_epoch
- Can free garbage with epoch
T
whenT < garbage_epoch
- Call this
- Proof sketch
- Garbage recorded with epoch
T
was rendered unreachable during epochT
- That means that only threads with
read_epoch <= T
could observe the garbage - If all threads have
read_epoch > T
, then no thread is observing the garbage now or will in future - So garbage is safe to free
- Garbage recorded with epoch
Other mechanisms
- Epoch-based reclamation is fast, other mechanisms can be even faster
- Quiescent-state based reclamation
- References
-
“RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link
Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.
-
“Performance of memory reclamation for lockless synchronization.” Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. Link
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use.
The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance.
We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.
-
- RCU is being standardized in C++ 26!