Lecture 21: Synchronization I

Synchronization parameters

Synchronization problems come in many shapes and sizes
Different kinds of problem have different “best” (lowest-overhead) solutions
The ideal synchronization mechanism would work well across a wide range of parameters
Our discussion applies to many synchronization mechanisms (e.g., mutual-exclusion locks, compare-and-swap-based lock-free data structures), but we refer to locks for concreteness

Simple test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
        }
    }
    void unlock() {
        f_.clear();
    }
};

Are atomics necessary?

The test-and-set lock uses atomic operations like test-and-set (which is implemented as an atomic swap)
- These correspond to special instructions like lock xchg or lock addq
Some papers have proposed mutual-exclusion locks that work without atomic operations
- Lamport’s bakery algorithm
But those don’t actually work on most machines!
Because of weak memory models

Weak memory model example

Initial condition: x = y = 0

Core A        Core B
(A1) x = 1    (B1) y = 1
(A2) ta = y   (B2) tb = x

Assume no optimization
What outcomes are possible?
- ta == 1 && tb == 1
- ta == 0 && tb == 1
- ta == 1 && tb == 0
- ta == 0 && tb == 0?

That weird result

Assume x and y are on different cache lines, and each core has both lines in cache
(A1) x = 1: Core A writes x ← 1, but this assignment hangs out in a store buffer
- Sequence of outstanding writes that haven’t made it out to the cache yet
- Allows write coalescing
(B1) y = 1: Core B writes y ← 1, but this assignment also hangs out in a store buffer
(A2) ta = y, (B2) tb = x: The assignments use the old, unmodified cache lines
TSO: Total Store Order
- The order generally implemented by x86 chips; x86-TSO formalizes these semantics
- Other, weaker memory orders exist, especially Arm
- See memory ordering for more
Thus, atomics and related calls: the compiler convinces the architecture to force the appropriate memory ordering

Synchronization goals

Throughput
- Number of successful lock acquisitions per second
Latency
- Delay imposed by lock acquisition
Fairness
- Lack of bias in lock acquisition
- 2 threads in a while (true) { l.lock(); l.unlock(); } loop for the same lock should acquire the lock about the same number of times
Space overhead
- Bytes per lock
Ease of use
- Likelihood that simple code using locks is correct
These goals can be in conflict!

Parameter: Contention

Contention measures the number of different threads that simultaneously (or nearly-simultaneously) need exclusive access to state
Low contention: Few threads need exclusive access
- Easy case!
- Throughput doesn’t matter
- Fairness doesn’t matter much (why not?)
- Latency matters
High contention: Many threads need exclusive access
- Inevitably low performance
- Throughput: Avoid contention collapse (starvation)
- Fairness matters

Parameter: Granularity

Granularity measures the number of locks in the system and the amount of state protected by each lock
Coarse granularity: Each lock protects lots of state; there are few locks
- Often has higher contention
- Often easier to program
- Can afford greater space overhead
Fine granularity: Each lock protects relatively little state; there are many locks
- Often has lower contention
- Often harder to use
- Space overhead per lock becomes important

Parameter: Write balance (write weight)

Write balance measures the fraction of operations that modify the protected state
Read-heavy workload: Most operations only observe the protected state
- Often supports non-exclusive locking: two threads that merely observe shared state can run in parallel
- Allowed by Fundamental Law of Synchronization
Write-heavy workload: Most operations modify the protected state
- Requires exclusive locking in many workloads
Balanced workload (many reads and many writes)

Write balance examples

Counters (“number of system calls executed since boot”)
Pipes (read and write)
struct proc components
Configuration data

Parameter: Thread count

Number of threads = number of cores
- Often best raw performance
- Usually situation in kernel (kernel explicitly schedules threads among cores)
Number of threads > number of cores
- Programming convenience
- Can complicate some locking designs

Architecture

Context: Modern multicore shared-memory machines
Synchronization is founded on atomic instructions on shared memory
Those instructions are based on a cache coherence protocol

MESI

Each cache line is in one of four states
- Modified
- Exclusive
- Shared
- Invalid
(Also MOESI, MESIF, …; tons of innovation within this basic idea)
Caches send messages to load lines, keep in sync

MESI

Modified
- The cache’s data is newer than primary memory
- The cache line must be written back to primary memory before another cache can load it
Exclusive
- The cache’s data is the same as primary memory
- The cache line is stored exactly one cache
Shared
- The cache’s data is the same as primary memory
- The cache line is stored in one or more caches
Invalid
- The cache’s data may be older than primary memory
- The processor must load the line before using it

MESI transition examples

Core 1 I→S: Core 1 loads a line for physical address pa
Core 1 S→M: Core 1 modifies that line
Core 1 M→S: Core 2 loads the same line
- Cache at core 1 must write the line back
Core 1 S→I: Core 2 loads the line in exclusive mode (because it will write it)
Core 1 I→E: Core 1 loads a line for future write/atomic access

MESI consequences

On MESI-like machines, all synchronization is implemented using locks
- Cache lines are locked (M/E states)
- The locks cannot be held forever, and there is no deadlock, but performance properties resemble those of locks
State is measured in cache line units

Simple test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
        }
    }
    void unlock() {
        f_.clear();
    }
};

`pause`

pause — spin loop hint

Opcode  Mnemonic    Description
F3 90   PAUSE       Gives hint to processor that improves
                    performance of spin-wait loops.

Description
Improves the performance of spin-wait loops. When
executing a "spin-wait loop," a Pentium 4 or Intel Xeon
processor suffers a severe performance penalty when
exiting the loop because it detects a possible memory
order violation. The PAUSE instruction provides a hint to
the processor that the code sequence is a spin-wait loop.
The processor uses this hint to avoid the memory order
violation in most situations, which greatly improves
processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.

Paused test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            pause();  // compiles to `pause`
        }
    }
    void unlock() {
        f_.clear();
    }
};

Contention level (latency vs. throughput)

Yielding test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            sched_yield();
        }
    }
    void unlock() {
        f_.clear();
    }
};

Thread count

Fairness

Example unfairness with TAS lock
```
2485624
2519544
2486734
2438033
```
One thread obtains the lock 3% more than another!
Not the full story

Fairness

Count number of times a lock is obtained multiple times consecutively by the same thread
```
2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)
```
>96% of lock acquisitions alternate with another thread
But >0.1% of lock acquisitions occur in a consecutive sequence of >28 acquisitions by the same thread

Solving fairness

Want a mechanism to serve lock requests in order
Must define an order

Ticket lock

struct ticket_lock {
    std::atomic<unsigned> now_ = 0;
    std::atomic<unsigned> next_ = 0;

    void lock() {
        unsigned me = next_++;
        while (me != now_) {
            pause();
        }
    }
    void unlock() {
        now_++;
    }
};

Sleeping?