# Lecture 21: Synchronization I

## Synchronization parameters

• Synchronization problems come in many shapes and sizes
• Different kinds of problem have different “best” (lowest-overhead) solutions
• The ideal synchronization mechanism would work well across a wide range of parameters
• Our discussion applies to many synchronization mechanisms (e.g., mutual-exclusion locks, compare-and-swap-based lock-free data structures), but we refer to locks for concreteness

## Simple test and set lock

struct spinlock {
std::atomic_flag f_;

void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};


## Are atomics necessary?

• The test-and-set lock uses atomic operations like test-and-set (which is implemented as an atomic swap)
• These correspond to special instructions like lock xchg or lock addq
• Some papers have proposed mutual-exclusion locks that work without atomic operations
• But those don’t actually work on most machines!
• Because of weak memory models

## Weak memory model example

• Initial condition: x = y = 0

Core A        Core B
(A1) x = 1    (B1) y = 1
(A2) ta = y   (B2) tb = x

• Assume no optimization

• What outcomes are possible?

• ta == 1 && tb == 1
• ta == 0 && tb == 1
• ta == 1 && tb == 0
• ta == 0 && tb == 0?

## That weird result

• Assume x and y are on different cache lines, and each core has both lines in cache
• (A1) x = 1: Core A writes x ← 1, but this assignment hangs out in a store buffer
• Sequence of outstanding writes that haven’t made it out to the cache yet
• Allows write coalescing
• (B1) y = 1: Core B writes y ← 1, but this assignment also hangs out in a store buffer
• (A2) ta = y, (B2) tb = x: The assignments use the old, unmodified cache lines
• TSO: Total Store Order
• The order generally implemented by x86 chips; x86-TSO formalizes these semantics
• Other, weaker memory orders exist, especially Arm
• See memory ordering for more
• Thus, atomics and related calls: the compiler convinces the architecture to force the appropriate memory ordering

## Synchronization goals

• Throughput
• Number of successful lock acquisitions per second
• Latency
• Delay imposed by lock acquisition
• Fairness
• Lack of bias in lock acquisition
• 2 threads in a while (true) { l.lock(); l.unlock(); } loop for the same lock should acquire the lock about the same number of times
• Bytes per lock
• Ease of use
• Likelihood that simple code using locks is correct
• These goals can be in conflict!

## Parameter: Contention

• Contention measures the number of different threads that simultaneously (or nearly-simultaneously) need exclusive access to state
• Low contention: Few threads need exclusive access
• Easy case!
• Throughput doesn’t matter
• Fairness doesn’t matter much (why not?)
• Latency matters
• High contention: Many threads need exclusive access
• Inevitably low performance
• Throughput: Avoid contention collapse (starvation)
• Fairness matters

## Parameter: Granularity

• Granularity measures the number of locks in the system and the amount of state protected by each lock
• Coarse granularity: Each lock protects lots of state; there are few locks
• Often has higher contention
• Often easier to program
• Can afford greater space overhead
• Fine granularity: Each lock protects relatively little state; there are many locks
• Often has lower contention
• Often harder to use
• Space overhead per lock becomes important

## Parameter: Write balance (write weight)

• Write balance measures the fraction of operations that modify the protected state
• Often supports non-exclusive locking: two threads that merely observe shared state can run in parallel
• Allowed by Fundamental Law of Synchronization
• Write-heavy workload: Most operations modify the protected state
• Requires exclusive locking in many workloads

## Write balance examples

• Counters (“number of system calls executed since boot”)
• Pipes (read and write)
• struct proc components
• Configuration data

• Number of threads = number of cores
• Often best raw performance
• Usually situation in kernel (kernel explicitly schedules threads among cores)
• Number of threads > number of cores
• Programming convenience
• Can complicate some locking designs

## Architecture

• Context: Modern multicore shared-memory machines
• Synchronization is founded on atomic instructions on shared memory
• Those instructions are based on a cache coherence protocol

## MESI

• Each cache line is in one of four states
• Modified
• Exclusive
• Shared
• Invalid
• (Also MOESI, MESIF, …; tons of innovation within this basic idea)
• Caches send messages to load lines, keep in sync

## MESI

• Modified
• The cache’s data is newer than primary memory
• The cache line must be written back to primary memory before another cache can load it
• Exclusive
• The cache’s data is the same as primary memory
• The cache line is stored exactly one cache
• Shared
• The cache’s data is the same as primary memory
• The cache line is stored in one or more caches
• Invalid
• The cache’s data may be older than primary memory
• The processor must load the line before using it

## MESI transition examples

• Core 1 I→S: Core 1 loads a line for physical address pa
• Core 1 S→M: Core 1 modifies that line
• Core 1 M→S: Core 2 loads the same line
• Cache at core 1 must write the line back
• Core 1 S→I: Core 2 loads the line in exclusive mode (because it will write it)
• Core 1 I→E: Core 1 loads a line for future write/atomic access

## MESI consequences

• On MESI-like machines, all synchronization is implemented using locks
• Cache lines are locked (M/E states)
• The locks cannot be held forever, and there is no deadlock, but performance properties resemble those of locks
• State is measured in cache line units

## Simple test and set lock

struct spinlock {
std::atomic_flag f_;

void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};


## pause

pause — spin loop hint

Opcode  Mnemonic    Description
F3 90   PAUSE       Gives hint to processor that improves
performance of spin-wait loops.

Description
Improves the performance of spin-wait loops. When
executing a "spin-wait loop," a Pentium 4 or Intel Xeon
processor suffers a severe performance penalty when
exiting the loop because it detects a possible memory
order violation. The PAUSE instruction provides a hint to
the processor that the code sequence is a spin-wait loop.
The processor uses this hint to avoid the memory order
violation in most situations, which greatly improves
processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.


## Paused test and set lock

struct spinlock {
std::atomic_flag f_;

void lock() {
while (f_.test_and_set()) {
pause();  // compiles to pause
}
}
void unlock() {
f_.clear();
}
};

• Contention level (latency vs. throughput)

## Yielding test and set lock

struct spinlock {
std::atomic_flag f_;

void lock() {
while (f_.test_and_set()) {
sched_yield();
}
}
void unlock() {
f_.clear();
}
};


## Fairness

• Example unfairness with TAS lock

2485624
2519544
2486734
2438033

• One thread obtains the lock 3% more than another!

• Not the full story

## Fairness

• Count number of times a lock is obtained multiple times consecutively by the same thread

2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)

• >96% of lock acquisitions alternate with another thread

• But >0.1% of lock acquisitions occur in a consecutive sequence of >28 acquisitions by the same thread

## Solving fairness

• Want a mechanism to serve lock requests in order
• Must define an order

## Ticket lock

struct ticket_lock {
std::atomic<unsigned> now_ = 0;
std::atomic<unsigned> next_ = 0;

void lock() {
unsigned me = next_++;
while (me != now_) {
pause();
}
}
void unlock() {
now_++;
}
};

• Sleeping?