Lecture 16: Synchronization goals, parameters, and MESI

Synchronization lectures

The next lectures concern synchronization. We’ll talk about how synchronization primitives, such as locks, are implemented; advanced schemes for synchronization that minimize overhead; and the properties of machines that make synchronization difficult and interesting. Although you could implement these ideas in your problem sets, you don’t need to.

Synchronization parameters

Synchronization problems come in many shapes and sizes
Different kinds of problem have different “best” (lowest-overhead) solutions
The ideal synchronization mechanism would work well across a wide range of parameters
Our discussion applies to many synchronization mechanisms (e.g., mutual-exclusion locks, compare-and-swap-based lock-free data structures), but we refer to locks for concreteness

Simple test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
        }
    }
    void unlock() {
        f_.clear();
    }
};

Goals

Throughput
Latency
Fairness
Space overhead
Ease of use
These goals can be in conflict!

Goals

Throughput
- Number of successful lock acquisitions per second
Latency
- Delay imposed by lock acquisition
Fairness
- Lack of bias in lock acquisition
- 2 threads in a while (1) { l.lock(); l.unlock(); } loop for the same lock should acquire the lock about the same number of times
Space overhead
- Bytes per lock
Ease of use
- Likelihood that simple code using locks is correct

Parameter: Contention

Contention measures the number of different threads that simultaneously (or nearly-simultaneously) need exclusive access to state
Low contention: Few threads need exclusive access
High contention: Many threads need exclusive access

Parameter: Contention

Low contention
- Easy case!
- Throughput doesn’t matter
- Fairness doesn’t matter (why not?)
- Latency matters
High contention
- Inevitably low performance
- Throughput: Avoid contention collapse (starvation)
- Fairness matters

Parameter: Granularity

Granularity measures the number of locks in the system and the amount of state protected by each lock
Coarse granularity: Each lock protects lots of state; there are few locks
Fine granularity: Each lock protects relatively little state; there are many locks

Parameter: Granularity

Coarse granularity
- Often has higher contention
- Often easier to program
- Can afford greater space overhead
Fine granularity
- Often has lower contention
- Often harder to use
- Space overhead per lock becomes important

Parameter: Write balance (write weight)

Write balance measures the fraction of operations that modify the protected state
Read-heavy workload: Most operations only observe the protected state
Write-heavy workload: Most operations modify the protected state

Parameter: Write balance

Read-heavy workload
- Often supports non-exclusive locking: two threads that merely observe shared state can run in parallel
- Allowed by Fundamental Law of Synchronization
Write-heavy workload
- Requires exclusive locking in many workloads
Balanced workload (many reads and many writes)

Write balance examples

Counters (“number of system calls executed since boot”)
Pipes (read and write)
struct proc components
Configuration data

Parameter: Thread count

Number of threads = number of cores
- Often best raw performance
- Usually situation in kernel (kernel explicitly schedules threads among cores)
Number of threads > number of cores
- Programming convenience
- Can complicate some locking designs

Architecture

Context: Modern multicore shared-memory machines
Synchronization is founded on atomic instructions on shared memory
Those instructions are based on a cache coherence protocol

MESI

Each cache line is in one of four states
- Modified
- Eexclusive
- Shared
- Invalid
(Also MOESI, MESIF, …; tons of “innovation”)
Caches send messages to load lines, keep in sync

MESI

Modified
- The cache’s data is newer than primary memory
- The cache line must be written back to primary memory before another cache can load it
Exclusive
- The cache’s data is the same as primary memory
- The cache line is stored exactly one cache
Shared
- The cache’s data is the same as primary memory
- The cache line is stored in one or more caches
Invalid
- The cache’s data may be older than primary memory
- The processor must load the line before using it

MESI transition examples

I→S: Cache 1 loads a line (e.g., for physical address pa)
S→M: Cache 1 modifies a loaded line
M→S: Cache 2 loads a line currently modified by Cache 1
- Cache 1 must write the line back
I→E: Cache 1 loads a line for future write/atomic access

MESI consequences

On MESI-like machines, all synchronization is implemented using locks
- Cache lines are locked (M/E states)
- The locks cannot be held forever, and there is no deadlock, but performance properties resemble those of locks
State is measured in cache line units

Simple test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
        }
    }
    void unlock() {
        f_.clear();
    }
};

`pause`

pause — spin loop hint

Opcode  Mnemonic    Description
F3 90   PAUSE       Gives hint to processor that improves
                    performance of spin-wait loops.

Description
Improves the performance of spin-wait loops. When
executing a "spin-wait loop," a Pentium 4 or Intel Xeon
processor suffers a severe performance penalty when
exiting the loop because it detects a possible memory
order violation. The PAUSE instruction provides a hint to
the processor that the code sequence is a spin-wait loop.
The processor uses this hint to avoid the memory order
violation in most situations, which greatly improves
processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.

Paused test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            pause();  // compiles to `pause`
        }
    }
    void unlock() {
        f_.clear();
    }
};

Contention level (latency vs. throughput)

Yielding test and set lock

struct spinlock {
    std::atomic_flag f_;

    void lock() {
        while (f_.test_and_set()) {
            sched_yield();
        }
    }
    void unlock() {
        f_.clear();
    }
};

Thread count

Fairness

Example unfairness with TAS lock

One thread obtains the lock 3% more than another!
Not the full story

Fairness

Count number of times a lock is obtained multiple times consecutively by the same thread

2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)