Synchronization lectures
The next lectures concern synchronization. We’ll talk about how synchronization primitives, such as locks, are implemented; advanced schemes for synchronization that minimize overhead; and the properties of machines that make synchronization difficult and interesting. Although you could implement these ideas in your problem sets, you don’t need to.
Synchronization parameters
- Synchronization problems come in many shapes and sizes
- Different kinds of problem have different “best” (lowest-overhead) solutions
- The ideal synchronization mechanism would work well across a wide range of parameters
- Our discussion applies to many synchronization mechanisms (e.g., mutual-exclusion locks, compare-and-swap-based lock-free data structures), but we refer to locks for concreteness
Simple test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};
Goals
- Throughput
- Latency
- Fairness
- Space overhead
- Ease of use
- These goals can be in conflict!
Goals
- Throughput
- Number of successful lock acquisitions per second
- Latency
- Delay imposed by lock acquisition
- Fairness
- Lack of bias in lock acquisition
- 2 threads in a
while (1) { l.lock(); l.unlock(); }
loop for the same lock should acquire the lock about the same number of times
- Space overhead
- Bytes per lock
- Ease of use
- Likelihood that simple code using locks is correct
Parameter: Contention
- Contention measures the number of different threads that simultaneously (or nearly-simultaneously) need exclusive access to state
- Low contention: Few threads need exclusive access
- High contention: Many threads need exclusive access
Parameter: Contention
- Low contention
- Easy case!
- Throughput doesn’t matter
- Fairness doesn’t matter (why not?)
- Latency matters
- High contention
- Inevitably low performance
- Throughput: Avoid contention collapse (starvation)
- Fairness matters
Parameter: Granularity
- Granularity measures the number of locks in the system and the amount of state protected by each lock
- Coarse granularity: Each lock protects lots of state; there are few locks
- Fine granularity: Each lock protects relatively little state; there are many locks
Parameter: Granularity
- Coarse granularity
- Often has higher contention
- Often easier to program
- Can afford greater space overhead
- Fine granularity
- Often has lower contention
- Often harder to use
- Space overhead per lock becomes important
Parameter: Write balance (write weight)
- Write balance measures the fraction of operations that modify the protected state
- Read-heavy workload: Most operations only observe the protected state
- Write-heavy workload: Most operations modify the protected state
Parameter: Write balance
- Read-heavy workload
- Often supports non-exclusive locking: two threads that merely observe shared state can run in parallel
- Allowed by Fundamental Law of Synchronization
- Write-heavy workload
- Requires exclusive locking in many workloads
- Balanced workload (many reads and many writes)
Write balance examples
- Counters (“number of system calls executed since boot”)
- Pipes (
read
andwrite
) struct proc
components- Configuration data
Parameter: Thread count
- Number of threads = number of cores
- Often best raw performance
- Usually situation in kernel (kernel explicitly schedules threads among cores)
- Number of threads > number of cores
- Programming convenience
- Can complicate some locking designs
Architecture
- Context: Modern multicore shared-memory machines
- Synchronization is founded on atomic instructions on shared memory
- Those instructions are based on a cache coherence protocol
MESI
- Each cache line is in one of four states
- Modified
- Eexclusive
- Shared
- Invalid
- (Also MOESI, MESIF, …; tons of “innovation”)
- Caches send messages to load lines, keep in sync
MESI
- Modified
- The cache’s data is newer than primary memory
- The cache line must be written back to primary memory before another cache can load it
- Exclusive
- The cache’s data is the same as primary memory
- The cache line is stored exactly one cache
- Shared
- The cache’s data is the same as primary memory
- The cache line is stored in one or more caches
- Invalid
- The cache’s data may be older than primary memory
- The processor must load the line before using it
MESI transition examples
- I→S: Cache 1 loads a line (e.g., for physical address
pa
) - S→M: Cache 1 modifies a loaded line
- M→S: Cache 2 loads a line currently modified by Cache 1
- Cache 1 must write the line back
- I→E: Cache 1 loads a line for future write/atomic access
MESI consequences
- On MESI-like machines, all synchronization is implemented using locks
- Cache lines are locked (M/E states)
- The locks cannot be held forever, and there is no deadlock, but performance properties resemble those of locks
- State is measured in cache line units
Simple test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};
pause
pause — spin loop hint
Opcode Mnemonic Description
F3 90 PAUSE Gives hint to processor that improves
performance of spin-wait loops.
Description
Improves the performance of spin-wait loops. When
executing a "spin-wait loop," a Pentium 4 or Intel Xeon
processor suffers a severe performance penalty when
exiting the loop because it detects a possible memory
order violation. The PAUSE instruction provides a hint to
the processor that the code sequence is a spin-wait loop.
The processor uses this hint to avoid the memory order
violation in most situations, which greatly improves
processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.
Paused test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
pause(); // compiles to `pause`
}
}
void unlock() {
f_.clear();
}
};
- Contention level (latency vs. throughput)
Yielding test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
sched_yield();
}
}
void unlock() {
f_.clear();
}
};
- Thread count
Fairness
- Example unfairness with TAS lock
2485624
2519544
2486734
2438033
- One thread obtains the lock 3% more than another!
- Not the full story
Fairness
- Count number of times a lock is obtained multiple times consecutively by the same thread
2101989 (3156)
2129185 (4436)
2139257 (4579)
2108552 (5182)