Lecture 20 – CS 161 2018

Notes by Thomas Lively

Schedule

Locking characteristics and throughput vs latency
Simple lock
Ticket lock
Mellor-Crummey Scott (MCS) lock
Read-write locking
Read Copy Update (RCU)

Locking/synchronization in practice

These parameters of the synchronization problem affect what synchronization solutions work best. The best locks work well across a wide range of parameters.

Contention level
- High contention = many (>1) interested waiters
- high contention is Bad(tm) (i.e. low performance)
- Fairness doesn't matter for low contention lock
- Fairness does matter for high contention locks!
- Latency matters for low contention locks (a shame to take a long time to acquire a lock that is not locked by anyone)
Coarse-grained vs fine-grained locking
- coarse-grained: few locks each protecting lots of data
- fine-grained: many locks each protecting little data
- coarse-grained: higher contention, less space/complexity overhead
- fine-grained: lower contention, more space/complexity overhead
Throughput vs latency
- throughput: lock acquisitions / time (higher is better)
- time: time to acquire a lock (lower is better)
- Some designs have high throughput but also high latency!
Reads and writes
- counters: writes >> reads (think about sloppy counters from section)
- pipes: all ops are writes ("reads" modify state)
- proc state: reads ~= writes
- config info: reads >> writes
- want to engineer solutions optimized for specific read/write patterns
Number of threads
- highest raw performance solutions use #threads = #cores
- that is also typical for kernel
- sometimes have #threads > #cores for programming convenience
- example: web browsers

Simple test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            pause();
        }
    }
    void unlock() {
        f_.clear();
    }
};

Does the pause instruction optimize for latency or throughput?

throughput, because it reduces the number of threads spinning and reading simultaneously. pause won't necessarily help acquire the thread faster, though.

What if we use sched_yield() instead of the pause instruction?

this would be better if #threads > #cores, because it allows some other useful work to be scheduled, but if #threads == #cores, this is just extra overhead.

Ticket lock

struct ticket_lock {
    std::atomic<unsigned> now_;
    std::atomic<unsigned> next_;

    void lock() {
        unsigned me = next_++;
        while (me != now_) {
            pause();
        }
    }
    void unlock() {
        now_++;
    }
};

More scalable because shared data is only written on unlock and the spin in lock only reads.

What if we use msleep(10 * (me - now_)) instead of pause or sched_yield()?

This would reduce spinning (it is real blocking!), but if 10ms is too long this will hurt both latency and throughput.

MCS (Mellor-Crummey Scott) lock

This is one of the first high-performance fair locks for high-contention scenarios. It’s very clever! Its basic data structure is a queue of waiting threads, implemented as a singly-linked list. The shared lock points at the tail of the list (the thread that will get the lock last, if any). Threads spin not on shared state, but on local state, which makes spinning much cheaper (it doesn't hammer the memory bus).

“Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.” John M. Mellor-Crummey and Michael L. Scott. ACM Transactions on Computer Systems 9(1). Link; Shorter earlier version

This lock is used in Java

It was proposed for Linux (that article also has a good series of example diagrams), but not adopted (due to high overhead)

struct mcs_lock {
    struct qentry {    // user must declare one of these to lock
        std::atomic<qentry*> next;
        std::atomic<bool> locked;
    };

    std::atomic<qentry*> lk_;   // points at tail

    void lock(qentry& w) {
        w.next = nullptr;
        qentry* prev_tail = lk_.exchange(&w);  // mark self as tail
        if (prev_tail) {                    // previous tail exists
            w.locked = true;
            prev_tail->next = &w;          // link to previous tail
            while (w.locked) {
                pause();
            }
        }
    }
    void unlock(qentry& w) {
        if (!w.next) {                    // maybe become unlocked
            if (lk_.compare_exchange_weak(&w, nullptr)) {
                return;
            }
            while (!w.next) {   // wait for next tail to link self
                pause();
            }
        }
        w.next->locked = false;
    }
}

// some function that uses a lock `l`
f() {
    ...
    mcs_lock::qentry w;
    l.lock(w);
    ...
    l.unlock(w);
    ...
}

There is much more overhead here. Notice the atomic ops and spins in both lock and unlock.

How does this lock work?

if atomic swap (std::atomic<T>::exchange) in lock returns non-nullptr, w is appended to linked list and spins until it is notified that it is no longer locked.
unlock waits until there is a next waiter in the list, and unlocks it

Each waiter is spinning on private state, not shared state!

What if we call sched_yield() in spin?

This would reduce overhead, but could decrease throughput considerably.

Unfortunately, this lock is optimized for high contention, but in that case you already have performance issues.

Futex

Let's put the waiting part of a lock in kernel space to decrease overhead!

Goals: no overhead in low contention case, fairness in high contention case. No overhead means we can’t go straight to the kernel—system calls are expensive! We must divide locking into two phases. In the first phase, we lock under a low-contention assumption, without calling into the kernel. In the second phase, we have observed contention, so we block, calling into the kernel (and obtaining fairness).

This code isn’t right (the futex interface is complicated!) but it shows the idea. For a very clear description of the reasons for this biphase structure (and a futexless implementation), read:

“Locking in WebKit” by Filip Pizlo

struct spinlock_futex {
    std::atomic<unsigned char> val_;
    // 0 = unlocked; 1 = locked
    // or'ed with bit 2 == futex_mode

    void lock() {
        // phase 1
        for (unsigned i = 0; i < 40; i++) {
            if (val_.compare_exchange_weak(0, 1)) {
                return;
            }
            sched_yield();
        }

        // phase 2: enter futex mode
        do {
            val_ |= 2;   // atomic operation
            futex(WAIT, &val_, /* block unless val_ != */ 3);
        } while (!val_.compare_exchange_weak(2, 3))
    }

    void unlock() {
        while (!val_.compare_exchange_weak(1, 0)) {
            if (val_ & 2) {
                // someone is waiting on the futex; wake them up
                futex(WAKE, &val_, /* atomically change val_ to */ 2);
                return;
            }
        }
    }
};

Two locks: user lock and kernel lock. futex syscall links the two.

With low contention kernel does not get involved. With high contention kernel handles fair wakeup. 40 user level tries is optimal for some reason.

Switching to a similar strategy increased performance by 10% in WebKit.

Read-write locks

These are good when reads are much more common than writes. There can be any number of readers or a single writer.

The lock has three states

unlocked (val_ == 0)
read locked (val_ > 0)
write locked (val_ == -1)

struct rw_lock {
    std::atomic<int> val_;

    void lock_read() {
        int x = val_;
        while (x < 0 || !val_.compare_exchange_weak(x, x + 1)) {
            pause();
            x = val_;
        }
    }
    void unlock_read() {
        --val_;
    }

    void lock_write() {
        while (!val_.compare_exchange_weak(0, -1)) {
            pause();
        }
    }
    void unlock_write() {
        val_ = 0;
    }
};

This read-write lock is not fair (it starves writers).

How can we reduce memory contention?

Give each thread its own val_.

struct rw_lock_2 {
    spinlock f_[NCPU];     // would really want separate cache lines

    void lock_read() {
        f_[this_cpu()].lock();
    }
    void unlock_read() {
        f_[this_cpu()].unlock();
    }

    void lock_write() {
        for (unsigned i = 0; i != NCPU; ++i) {
            f_[i].lock();
        }
    }
    void unlock_write() {
        for (unsigned i = 0; i != NCPU; ++i) {
            f_[i].unlock();
        }
    }
};

This makes reads super duper fast and writes fairly slow.

RCU (Read-Copy-Update)

Goal: zero-op locks.

To be continued!

References

Futexes

The futex(2) manual page

A futex overview and update by Darren Hart

Example usage: “Futex based locks for C11’s generic atomics.” Jens Gustedt. [Research Report] RR-8818, INRIA Nancy, 2015. Link

RCU

“RCU Usage In the Linux Kernel: One Decade Later.” Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. Link

Read-copy update (RCU) is a scalable high-performance synchronization mechanism implemented in the Linux kernel. RCU’s novel properties include support for concurrent reading and writing, and highly optimized inter-CPU synchronization. Since RCU’s introduction into the Linux kernel over a decade ago its usage has continued to expand. Today, most kernel subsystems use RCU. This paper discusses the requirements that drove the development of RCU, the design and API of the Linux RCU implementation, and how kernel developers apply RCU.

Lecture 20: Scalable and read-write locks