Fairness
- Count number of times a lock is obtained multiple times consecutively by the same thread
2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)
- More data
- >96% of lock acquisitions alternate with another thread
- But >0.1% of lock acquisitions occur in a consecutive sequence of >28 acquisitions by the same thread
Solving fairness
- Want a mechanism to serve lock requests in order
- Must define an order
Ticket lock
struct ticket_lock {
std::atomic<unsigned> now_ = 0;
std::atomic<unsigned> next_ = 0;
void lock() {
unsigned me = next_++;
while (me != now_) {
pause();
}
}
void unlock() {
now_++;
}
};
- Sleeping?
MCS (Mellor-Crummey Scott) lock
- Link; Shorter earlier version
- Early high-performance fair lock, good at high contention
- Data structure: A queue of waiting threads, implemented as a singly-linked
list
- Shared lock points at the tail of the list (the thread that will get the lock last, if any)
- Threads spin not on shared state, but on local state, which makes spinning much cheaper (no hammering the cache coherence protocol)
- Used widely in Java, variant proposed for Linux (used narrowly due to high default overhead)
struct mcs_lock {
struct qentry { // user must declare one of these to lock
std::atomic<qentry*> next;
std::atomic<bool> blocked;
};
std::atomic<qentry*> lk_; // points at tail
void lock(qentry& w) {
w.next = nullptr;
qentry* prev_tail = lk_.exchange(&w); // mark self as tail
if (prev_tail) { // previous tail exists
w.blocked = true;
prev_tail->next = &w; // link to previous tail
while (w.blocked) {
pause();
}
}
}
void unlock(qentry& w) {
qentry* expected_tail = &w;
if (!w.next
&& lk_.compare_exchange_weak(expected_tail, nullptr)) {
return; // no one else is waiting
}
while (!w.next) { // wait for next tail to link self
pause();
}
w.next->blocked = false;
}
}
// some function that uses a lock `l`
f() {
...
mcs_lock::qentry w;
l.lock(w);
...
l.unlock(w);
...
}
About MCS
- Much more overhead!
- Atomic operations and spins in both lock and unlock
- Higher latency, lower throughput
- Relatively high space overhead
- Advantages to this overhead
- Fairness
- Each waiter is spinning on private state, not shared state!
- Fewer cache line conflicts
- How does it work?
- If atomic swap (
std::atomic<T>::exchange
) inlock
returns non-nullptr,w
is appended to linked list and spins until it is notified that it is no longer locked. unlock
waits until there is a next waiter in the list, and unlocks it.
- If atomic swap (
- MCS lock is optimized for high contention, but in that case you already have performance issues.
Blocking in locks
- Ideally, we’d like a thread blocked on a lock to sleep
- Lower overhead on high contention
- Supports higher thread counts
- But requires kernel involvement
- Pipes?
- Kernel-managed lock abstractions?
- Some new primitive?
Mutex goals
- Goal: low overhead at low contention, fairness at high contention
- Goal: block on sleep (requires kernel communication)
- Should
lock
be a system call?- No! System calls are expensive
Futex
- Two-phase design
- Phase 1
- Assume low contention
- No kernel involvement
- User enforces FIFO if desired
- Phase 2
- After observing contention
- Kernel involvement: user blocks
- Kernel enforces FIFO
- (Except a thread can get lucky, and threads can have different priority, etc.)
- User can enforce lock order
- References
- The
futex
(2) manual page - “Futexes Are Tricky”, by Ulrich Drepper
- A futex overview and update by Darren Hart
- Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015
- “Locking in WebKit” by Filip Pizlo—not about futexes, but a very clear explanation of this two-phase structure
- The
Futex example (Drepper’s “Mutex #3”)
struct futex_lock {
std::atomic<int> val_;
// 0 = unlocked; 1 = locked, no futex waiters;
// 2 = locked, maybe futex waiters
void lock() {
// phase 1
for (unsigned i = 0; i < 40; i++) {
int expected = 0;
if (val_.compare_exchange_weak(expected, 1)) {
return;
}
sched_yield();
}
// phase 2
int previous = val_.load(std::memory_order_relaxed);
if (previous != 2) {
previous = val_.exchange(2);
}
while (previous != 0) {
futex(&val_, FUTEX_WAIT, 2);
previous = val_.exchange(2);
}
}
void unlock() {
if (--val_ != 0) { // atomic decrement
val_ = 0;
futex(&val_, FUTEX_WAKE, 1);
}
}
};
- With low contention kernel does not get involved; with high contention threads block
- 40 user level tries seems optimal for some reason
- Switching to a similar strategy increased performance by 10% in WebKit
Readers/writer locks
These are good when reads are much more common than writes. There can be any number of readers or a single writer.
The lock has three states
- unlocked (val_ == 0)
- read locked (val_ > 0)
- write locked (val_ == -1)
struct rw_lock {
std::atomic<int> val_;
void lock_read() {
int expected = val_;
while (expected < 0
|| !val_.compare_exchange_weak
(expected, expected + 1)) {
pause();
expected = val_;
}
}
void unlock_read() {
--val_;
}
void lock_write() {
int expected = 0;
while (!val_.compare_exchange_weak(expected, -1)) {
pause();
expected = 0;
}
}
void unlock_write() {
val_ = 0;
}
};
- This readers/writer lock is not fair (it starves writers).
rwlocks: Reducing memory contention
- Give each thread its own
val_
struct rw_lock_2 {
spinlock f_[NCPU]; // would really want separate cache lines
void lock_read() {
f_[this_cpu()].lock();
}
void unlock_read() {
f_[this_cpu()].unlock();
}
void lock_write() {
for (unsigned i = 0; i != NCPU; ++i) {
f_[i].lock();
}
}
void unlock_write() {
for (unsigned i = 0; i != NCPU; ++i) {
f_[i].unlock();
}
}
};
- This makes reads fast and writes fairly slow
Write balance in practice
- Many structures in practice have extremely unbalanced workloads
- Many more reads than writes
- For some structures (e.g., OS configuration), millions of times more
- Cost of rwlocks?
- Can we reduce the cost of reads even further?
RCU (Read-Copy Update)
- Goal: zero-op read locks
- Surprisingly, this ends up becoming a garbage collection problem!
Zero-op read locks: the simple case
- Let’s say we wanted to read............................
Reading and writing integers safely
- Use atomics!
std::memory_order_relaxed
indicates that serial orders are not required
std::atomic<int> val;
int read() {
return val.load(std::memory_order_relaxed);
}
void write(int x) {
val.store(x, std::memory_order_relaxed);
}
- Commonly want more exclusion on writes
std::atomic<int> val;
spinlock val_lock;
int read() {
return val.load(std::memory_order_relaxed);
}
void modify() {
val_lock.lock();
... compute with current `val`; other writers are excluded ...
val.store(x, std::memory_order_relaxed);
val_lock.unlock();
}
- Works for ≤8-byte objects that are aligned (not crossing cache line boundaries)
- What about larger objects?
Idea: Use atomics to name versions
- Read: Obtain a pointer to the version
- That version will never change
- Write: Create a new version, install a pointer to it
struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
bigvalue* read_snapshot() {
return val.load(std::memory_order_relaxed);
}
void modify() {
val_lock.lock();
bigvalue* newval = new bigvalue;
... compute with `val`; initialize `newval` ...
val.store(newval, std::memory_order_relaxed);
val_lock.unlock();
}
Memory allocation and RCU
- When can we delete the old version?
- This is a garbage collection problem!
- Don’t want a full garbage collector
- Even highly concurrent garbage collectors can “stop the world” for some fraction of their runtime
- Explicit memory deallocation is more efficient
- Timely memory deallocation can be important to avoid running out of memory
Idea: Epoch-based reclamation
- Track “read-side critical sections”
- Sort of like a read lock
- Sort of like a reference
- But not associated with specific data
- Locks the existence of all objects reachable at a certain time
- Cheaper, lower-overhead than explicit data-associated locks
Sketch
struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
std::deque<std::pair<bigvalue*, time_t>> val_garbage;
time_t current_epoch;
time_t read_epochs[NCPU];
void read() {
// start read-side critical section
read_epochs[this_cpu()] = current_epoch;
bigvalue* v = val.load(std::memory_order_relaxed);
... use `v` arbitrarily ...
// mark completion
read_epochs[this_cpu()] = 0;
... MUST NOT refer to `v` (it might be freed!) ...
}
void modify() {
val_lock.lock();
bigvalue* oldval = val;
bigvalue* newval = new bigvalue;
... compute, initialize ...
val.store(newval);
val_lock.unlock();
val_garbage.push_back({oldval, current_epoch});
// NB should lock `val_garbage`
}
void gc() {
// run periodically
time_t garbage_epoch = min(read_epochs); // ignoring zeroes
++current_epoch;
free all `val_garbage` older than `garbage_epoch`;
}
- Must ensure that all snapshots obtained in a read-side critical section exist until the read-side critical section completes
- Garbage rendered unreachable at time T must not be freed until all read-side critical sections active at time T complete
- Solution: Reading threads record the current time (
current_epoch
) in a global variable (read_epoch
) when beginning a read-side critical section - Garbage is stored in a list,
val_garbage
- Each piece of garbage is accompanied by an epoch initialized with
current_epoch
- Each piece of garbage is accompanied by an epoch initialized with
- Garbage is freed based on the minimum
read_epoch
of any reading thread- Call this
garbage_epoch
- Can free garbage with epoch
T
whenT < garbage_epoch
- Call this
- Proof sketch
- Garbage recorded with epoch
T
was rendered unreachable during epochT
- That means that only threads with
read_epoch <= T
could observe the garbage - If all threads have
read_epoch > T
, then no thread is observing the garbage now or will in future - So garbage is safe to free
- Garbage recorded with epoch
Other mechanisms
- Epoch-based reclamation is fast, other mechanisms can be even faster
- Quiescent-state based reclamation
References
“RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link
Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.
“Performance of memory reclamation for lockless synchronization.” Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. Link
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use.
The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance.
We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.