Lecture 23: Synchronization III

Blocking in locks

Mutex goals

Futex

Futex example (Drepper’s “Mutex #3”)

struct futex_lock {
    std::atomic<int> val_;
    // 0 = unlocked; 1 = locked, no futex waiters;
    // 2 = locked, maybe futex waiters

    void lock() {
        // phase 1
        for (unsigned i = 0; i != 40; ++i) {
            int expected = 0;
            if (val_.compare_exchange_weak(expected, 1)) {
                return;
            }
            sched_yield();
        }

        // phase 2
        int previous = val_.load(std::memory_order_relaxed);
        if (previous != 2) {
            previous = val_.exchange(2);
        }
        while (previous != 0) {
            futex(&val_, FUTEX_WAIT, 2);
            previous = val_.exchange(2);
        }
    }

    void unlock() {
        if (--val_ != 0) {    // atomic decrement
            val_ = 0;
            futex(&val_, FUTEX_WAKE, 1);
        }
    }
};

Futex system call semantics

Futex treasure hunt

  1. Describe an execution (a sequence of lock & unlock operations, attributed to specific threads) where val_ never becomes 2

  2. Describe an execution where val_ takes values 0 → 1 → 2 → 1 → 0 → 2 → 1 → 0, and say where the futex system calls occur

  3. Find an optimization in the code that avoids an atomic operation in the common case (i.e., code that could be removed without breaking semantics)

  4. Describe a missing optimization that could avoid useless atomic operations on a contended lock

  1. If the lock is never acquired, then val_ is always 0! Or, if the lock is acquired once once and then never released, it takes values 0 → 1. Any uncontended execution—where every attempt to acquire the lock happens when the lock is unlocked—will alternate between values 0 and 1.

  2. Here’s an example sequence:

    Thread Lock holder Description val_ futex calls
    0. The lock is initially unlocked 0
    1. A A Calls lock();
    val_.compare_exchange_weak succeeds
    1
    2. B A Calls lock();
    val_.compare_exchange_weak fails 40x
    3. B A val_.exchange(2) 2
    4. B A Blocks futex(&val_, FUTEX_WAIT, 2)
    5. A A Calls unlock(); --val_ 1
    6. A Sets val_ = 0 0 futex(&val_, FUTEX_WAKE, 1)
    7. B B Unblocks, performs val_.exchange(2);
    acquires lock because previous == 0
    2
    8. B B Calls unlock(); --val_ 1
    9. B Sets val_ = 0 0 futex(&val_, FUTEX_WAKE, 1)
  3. The entire “Phase 1” loop is an optimization! Also an optimization is this, at the start of Phase 2:

    int previous = val_.load(std::memory_order_relaxed);
    if (previous != 2) {
        previous = val_.exchange(2);
    }
    

    The following would be equivalent. The optimized version, however, avoids an atomic memory write in the expected case at this point in the code: Phase 2 is reached when the lock is contended, and if the lock is contended val_ is usually 2.

    int previous = val_.exchange(2);
    
  4. Arguably, in Phase 1, if val_’s previous value is observed to be 2, then the locking thread should exit the Phase 1 loop and go straight to Phase 2.

    for (unsigned i = 0; i < 40; ++i) {
        int expected = 0;
        if (val_.compare_exchange_weak(expected, 1)) {
            return;
        } else if (expected == 2) {
            break;
        }
        sched_yield();  // or `pause()`
    }
    

Futex code (repeated)

struct futex_lock {
    std::atomic<int> val_;
    // 0 = unlocked; 1 = locked, no futex waiters;
    // 2 = locked, maybe futex waiters

    void lock() {
        // phase 1
        for (unsigned i = 0; i != 40; ++i) {
            int expected = 0;
            if (val_.compare_exchange_weak(expected, 1)) {
                return;
            }
            sched_yield();   // or `pause()`
        }

        // phase 2
        int previous = val_.load(std::memory_order_relaxed);
        if (previous != 2) {
            previous = val_.exchange(2);
        }
        while (previous != 0) {
            futex(&val_, FUTEX_WAIT, 2);
            previous = val_.exchange(2);
        }
    }

    void unlock() {
        if (--val_ != 0) {    // atomic decrement
            val_ = 0;
            futex(&val_, FUTEX_WAKE, 1);
        }
    }
};

Futex system call implementation

A futex is a 32‐bit value—referred to below as a futex word—whose address is supplied to the futex() system call. (Futexes are 32 bits in size on all platforms, including 64‐bit systems.) All futex operations are governed by this value. … [T]he futex word is used to connect the synchronization in user space with the implementation of blocking by the kernel. Analogously to an atomic compare‐and‐exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare‐and‐block operation.

Futex system call design

Write balance in practice

RCU (Read-Copy Update)

Zero-op read locks: the simple case

an integer

Reading and writing integers safely

std::atomic<int> val;

int read() {
    return val.load(std::memory_order_relaxed);
}

void write(int x) {
    val.store(x, std::memory_order_relaxed);
}
std::atomic<int> val;
spinlock val_lock;

int read() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    ... compute with current `val`; other writers are excluded ...
    val.store(x, std::memory_order_relaxed);
    val_lock.unlock();
}

Idea: Use atomics to name versions

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;

bigvalue* read_snapshot() {
    return val.load(std::memory_order_relaxed);
}

void modify() {
    val_lock.lock();
    bigvalue* newval = new bigvalue;
    ... compute with `val`; initialize `newval` ...
    val.store(newval, std::memory_order_relaxed);
    val_lock.unlock();
}

Memory allocation and RCU

Idea: Epoch-based reclamation

Sketch

struct bigvalue { ... };
std::atomic<bigvalue*> val;
spinlock val_lock;
std::deque<std::pair<bigvalue*, time_t>> val_garbage;
time_t current_epoch;
time_t read_epochs[NCPU];

void read() {
    // start read-side critical section
    read_epochs[this_cpu()] = current_epoch;

    bigvalue* v = val.load(std::memory_order_relaxed);
    ... use `v` arbitrarily ...

    // mark completion
    read_epochs[this_cpu()] = 0;

    ... MUST NOT refer to `v` (it might be freed!) ...
}

void modify() {
    val_lock.lock();
    bigvalue* oldval = val;
    bigvalue* newval = new bigvalue;
    ... compute, initialize ...
    val.store(newval);
    val_lock.unlock();

    val_garbage.push_back({oldval, current_epoch});
    // NB should lock `val_garbage`
}

void gc() {
    // run periodically
    time_t garbage_epoch = min(read_epochs); // ignoring zeroes
    ++current_epoch;
    free all `val_garbage` older than `garbage_epoch`;
}

Other mechanisms