Notes by Thomas Lively
Front matter
Section tomorrow on multicore scalability
- Barrelfish: you need to redesign your kernel to be scalable
- Linux: nope
Multiprocessors
Shared-memory multiprocessors
- Low level caches are not shared. Use cache coherency protocols to lock data.
- Goal: If two processors access same memory, get same result
- This is more useful than C just calling it UB
MESI protocol
- Modified (value in cache is newer than in memory, no other cache has value)
- Exclusive (value in cache is same as in memory, no other cache has value)
- Shared (value in cache is same as in memory, other caches may have value)
- Invalid (not cached)
No state allows for having out of date data
- Shared or Invalid to Modified involves all other caches being invalidated
- Exclusive to Modified is super cheap
Having Modified state allows for batching of writes
Programming consideration
Cache line bouncing is super expensive
Memory is expensive when it's contended (many cores writing)
Fairness-efficiency tradeoff
- It's faster for CPUs owning cache lines to keep them rather than let another CPU in
Want frequently read data to be in Shared state, so it shouldn't share cache lines with write-heavy data (i.e. false sharing)
How to debug false sharing?
- Having heard that this problem exists is good
- Julia Evans has a comic (see Perf section)
Lock implementation
struct lock {
int v_;
void lock() {
while (atomic_test_and_set(&v_, 1) == 1) {
pause(); // “relax CPU until something happens”
}
}
void unlock() {
v_ = 0;
}
}
Need special atomic instructions, not normal reads and writes.
The Chickadee spinlock uses the std::atomic_flag
type, which is
specialized for this purpose (although in practice on x86 it’s implemented the same
way):
struct lock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
pause();
}
}
void unlock() {
f_.clear();
}
}
Memory Models
initial conditions: x = y = 0
TA TB
(A1) x = 1 (B1) y = 1
(A2) t1 = y (B2) t2 = x
t1 = t2 = 1
- given by A1, B1, A2, B2
t1 = 0, t2 = 1
- given by A1, A2, B1, B2
t1 = 1, t2 = 0
- given by B1, B2, A1, A2
But what about t1 = t2 = 0?
- compilers and processors are allowed to do out of order execution!
- given by A2, B2, A1, B1 - This is weird but happens!
How does this work with the MESI protocol?
processor does a store
- store enqueued on store buffer
- in parallel with processor:
- pop store buffer
- Run MESI/MOESI in that store line
- Perform store
Need hardware support for atomics because processor cannot see into other store buffers
TSO - total store order model: All processors agree on order of writes to main memory
- x86-TSO paper formalizes these semantics
Related: Threads cannot be implemented as a library
Lock implementation continued
Ticket lock
struct tlock {
std::atomic<unsigned> now_;
std::atomic<unsigned> next_;
void lock() {
unsigned my_ticket = next_++;
while (my_ticket != now_.load(std::memory_order_relaxed)) {
// `my_ticket != now_` works fine too
pause();
}
}
void unlock() {
++now_;
}
}