Lecture 19 – CS 161 2018

Notes by Thomas Lively

Front matter

Section tomorrow on multicore scalability

Barrelfish: you need to redesign your kernel to be scalable
Linux: nope

Multiprocessors

Shared-memory multiprocessors

Low level caches are not shared. Use cache coherency protocols to lock data.
Goal: If two processors access same memory, get same result
This is more useful than C just calling it UB

MESI protocol

Modified (value in cache is newer than in memory, no other cache has value)
Exclusive (value in cache is same as in memory, no other cache has value)
Shared (value in cache is same as in memory, other caches may have value)
Invalid (not cached)

No state allows for having out of date data

Shared or Invalid to Modified involves all other caches being invalidated
Exclusive to Modified is super cheap

Having Modified state allows for batching of writes

Programming consideration

Cache line bouncing is super expensive

Memory is expensive when it's contended (many cores writing)

Fairness-efficiency tradeoff

It's faster for CPUs owning cache lines to keep them rather than let another CPU in

Want frequently read data to be in Shared state, so it shouldn't share cache lines with write-heavy data (i.e. false sharing)

How to debug false sharing?

Having heard that this problem exists is good
Julia Evans has a comic (see Perf section)

Lock implementation

struct lock {
    int v_;
    
    void lock() {
        while (atomic_test_and_set(&v_, 1) == 1) {
            pause();   // “relax CPU until something happens”
        }
    }
    
    void unlock() {
        v_ = 0;
    }
}

Need special atomic instructions, not normal reads and writes.

The Chickadee spinlock uses the std::atomic_flag type, which is specialized for this purpose (although in practice on x86 it’s implemented the same way):

struct lock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            pause();
        }
    }

    void unlock() {
        f_.clear();
    }
}

Memory Models

initial conditions: x = y = 0

    TA           TB
(A1) x = 1    (B1) y = 1
(A2) t1 = y   (B2) t2 = x

t1 = t2 = 1

given by A1, B1, A2, B2

t1 = 0, t2 = 1

given by A1, A2, B1, B2

t1 = 1, t2 = 0

given by B1, B2, A1, A2

But what about t1 = t2 = 0?

compilers and processors are allowed to do out of order execution!
given by A2, B2, A1, B1 - This is weird but happens!

How does this work with the MESI protocol?

processor does a store

store enqueued on store buffer
in parallel with processor:
1. pop store buffer
2. Run MESI/MOESI in that store line
3. Perform store

Need hardware support for atomics because processor cannot see into other store buffers

TSO - total store order model: All processors agree on order of writes to main memory

x86-TSO paper formalizes these semantics

Lock implementation continued

Ticket lock

struct tlock {
    std::atomic<unsigned> now_;
    std::atomic<unsigned> next_;
    
    void lock() {
        unsigned my_ticket = next_++;
        while (my_ticket != now_.load(std::memory_order_relaxed)) {
            // `my_ticket != now_` works fine too
            pause();
        }
    }
    
    void unlock() {
        ++now_;
    }
}

Lecture 19: Multicore scalability