Notes by Thomas Lively
Schedule
- Locking characteristics and throughput vs latency
- Simple lock
- Ticket lock
- Mellor-Crummey Scott (MCS) lock
- Read-write locking
- Read Copy Update (RCU)
Locking/synchronization in practice
These parameters of the synchronization problem affect what synchronization solutions work best. The best locks work well across a wide range of parameters.
- Contention level
- High contention = many (>1) interested waiters
- high contention is Bad(tm) (i.e. low performance)
- Fairness doesn't matter for low contention lock
- Fairness does matter for high contention locks!
- Latency matters for low contention locks (a shame to take a long time to acquire a lock that is not locked by anyone)
- Coarse-grained vs fine-grained locking
- coarse-grained: few locks each protecting lots of data
- fine-grained: many locks each protecting little data
- coarse-grained: higher contention, less space/complexity overhead
- fine-grained: lower contention, more space/complexity overhead
- Throughput vs latency
- throughput: lock acquisitions / time (higher is better)
- time: time to acquire a lock (lower is better)
- Some designs have high throughput but also high latency!
- Reads and writes
- counters: writes >> reads (think about sloppy counters from section)
- pipes: all ops are writes ("reads" modify state)
- proc state: reads ~= writes
- config info: reads >> writes
- want to engineer solutions optimized for specific read/write patterns
- Number of threads
- highest raw performance solutions use #threads = #cores
- that is also typical for kernel
- sometimes have #threads > #cores for programming convenience
- example: web browsers
Simple test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
pause();
}
}
void unlock() {
f_.clear();
}
};
Does the pause
instruction optimize for latency or throughput?
- throughput, because it reduces the number of threads spinning and
reading simultaneously.
pause
won't necessarily help acquire the thread faster, though.
What if we use sched_yield()
instead of the pause
instruction?
- this would be better if #threads > #cores, because it allows some other useful work to be scheduled, but if #threads == #cores, this is just extra overhead.
Ticket lock
struct ticket_lock {
std::atomic<unsigned> now_;
std::atomic<unsigned> next_;
void lock() {
unsigned me = next_++;
while (me != now_) {
pause();
}
}
void unlock() {
now_++;
}
};
More scalable because shared data is only written on unlock and the spin in
lock
only reads.
What if we use msleep(10 * (me - now_))
instead of pause
or
sched_yield()
?
- This would reduce spinning (it is real blocking!), but if 10ms is too long this will hurt both latency and throughput.
MCS (Mellor-Crummey Scott) lock
This is one of the first high-performance fair locks for high-contention scenarios. It’s very clever! Its basic data structure is a queue of waiting threads, implemented as a singly-linked list. The shared lock points at the tail of the list (the thread that will get the lock last, if any). Threads spin not on shared state, but on local state, which makes spinning much cheaper (it doesn't hammer the memory bus).
“Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.” John M. Mellor-Crummey and Michael L. Scott. ACM Transactions on Computer Systems 9(1). Link; Shorter earlier version
This lock is used in Java
It was proposed for Linux (that article also has a good series of example diagrams), but not adopted (due to high overhead)
struct mcs_lock {
struct qentry { // user must declare one of these to lock
std::atomic<qentry*> next;
std::atomic<bool> locked;
};
std::atomic<qentry*> lk_; // points at tail
void lock(qentry& w) {
w.next = nullptr;
qentry* prev_tail = lk_.exchange(&w); // mark self as tail
if (prev_tail) { // previous tail exists
w.locked = true;
prev_tail->next = &w; // link to previous tail
while (w.locked) {
pause();
}
}
}
void unlock(qentry& w) {
if (!w.next) { // maybe become unlocked
if (lk_.compare_exchange_weak(&w, nullptr)) {
return;
}
while (!w.next) { // wait for next tail to link self
pause();
}
}
w.next->locked = false;
}
}
// some function that uses a lock `l`
f() {
...
mcs_lock::qentry w;
l.lock(w);
...
l.unlock(w);
...
}
There is much more overhead here. Notice the atomic ops and spins in both lock and unlock.
How does this lock work?
- if atomic swap (
std::atomic<T>::exchange
) inlock
returns non-nullptr,w
is appended to linked list and spins until it is notified that it is no longer locked. - unlock waits until there is a next waiter in the list, and unlocks it
Each waiter is spinning on private state, not shared state!
What if we call sched_yield()
in spin?
- This would reduce overhead, but could decrease throughput considerably.
Unfortunately, this lock is optimized for high contention, but in that case you already have performance issues.
Futex
Let's put the waiting part of a lock in kernel space to decrease overhead!
Goals: no overhead in low contention case, fairness in high contention case. No overhead means we can’t go straight to the kernel—system calls are expensive! We must divide locking into two phases. In the first phase, we lock under a low-contention assumption, without calling into the kernel. In the second phase, we have observed contention, so we block, calling into the kernel (and obtaining fairness).
This code isn’t right (the futex interface is complicated!) but it shows the idea. For a very clear description of the reasons for this biphase structure (and a futexless implementation), read:
struct spinlock_futex {
std::atomic<unsigned char> val_;
// 0 = unlocked; 1 = locked
// or'ed with bit 2 == futex_mode
void lock() {
// phase 1
for (unsigned i = 0; i < 40; i++) {
if (val_.compare_exchange_weak(0, 1)) {
return;
}
sched_yield();
}
// phase 2: enter futex mode
do {
val_ |= 2; // atomic operation
futex(WAIT, &val_, /* block unless val_ != */ 3);
} while (!val_.compare_exchange_weak(2, 3))
}
void unlock() {
while (!val_.compare_exchange_weak(1, 0)) {
if (val_ & 2) {
// someone is waiting on the futex; wake them up
futex(WAKE, &val_, /* atomically change val_ to */ 2);
return;
}
}
}
};
Two locks: user lock and kernel lock. futex
syscall links the two.
With low contention kernel does not get involved. With high contention kernel handles fair wakeup. 40 user level tries is optimal for some reason.
Switching to a similar strategy increased performance by 10% in WebKit.
Read-write locks
These are good when reads are much more common than writes. There can be any number of readers or a single writer.
The lock has three states
- unlocked (val_ == 0)
- read locked (val_ > 0)
- write locked (val_ == -1)
struct rw_lock {
std::atomic<int> val_;
void lock_read() {
int x = val_;
while (x < 0 || !val_.compare_exchange_weak(x, x + 1)) {
pause();
x = val_;
}
}
void unlock_read() {
--val_;
}
void lock_write() {
while (!val_.compare_exchange_weak(0, -1)) {
pause();
}
}
void unlock_write() {
val_ = 0;
}
};
This read-write lock is not fair (it starves writers).
How can we reduce memory contention?
- Give each thread its own
val_
.
struct rw_lock_2 {
spinlock f_[NCPU]; // would really want separate cache lines
void lock_read() {
f_[this_cpu()].lock();
}
void unlock_read() {
f_[this_cpu()].unlock();
}
void lock_write() {
for (unsigned i = 0; i != NCPU; ++i) {
f_[i].lock();
}
}
void unlock_write() {
for (unsigned i = 0; i != NCPU; ++i) {
f_[i].unlock();
}
}
};
This makes reads super duper fast and writes fairly slow.
RCU (Read-Copy-Update)
Goal: zero-op locks.
To be continued!
References
Futexes
A futex overview and update by Darren Hart
Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015. Link
RCU
“RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link
Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.