Synchronization parameters
- Synchronization problems come in many shapes and sizes
- Different kinds of problem have different “best” (lowest-overhead) solutions
- The ideal synchronization mechanism would work well across a wide range of parameters
- Our discussion applies to many synchronization mechanisms (e.g., mutual-exclusion locks, compare-and-swap-based lock-free data structures), but we refer to locks for concreteness
Simple test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};
Are atomics necessary?
- The test-and-set lock uses atomic operations like test-and-set (which is
implemented as an atomic swap)
- These correspond to special instructions like
lock xchg
orlock addq
- These correspond to special instructions like
- Some papers have proposed mutual-exclusion locks that work without atomic operations
- But those don’t actually work on most machines!
- Because of weak memory models
Weak memory model example
-
Initial condition:
x = y = 0
Core A Core B (A1) x = 1 (B1) y = 1 (A2) ta = y (B2) tb = x
-
Assume no optimization
-
What outcomes are possible?
ta == 1 && tb == 1
ta == 0 && tb == 1
ta == 1 && tb == 0
ta == 0 && tb == 0
?
That weird result
- Assume
x
andy
are on different cache lines, and each core has both lines in cache (A1) x = 1
: Core A writesx ← 1
, but this assignment hangs out in a store buffer- Sequence of outstanding writes that haven’t made it out to the cache yet
- Allows write coalescing
(B1) y = 1
: Core B writesy ← 1
, but this assignment also hangs out in a store buffer(A2) ta = y
,(B2) tb = x
: The assignments use the old, unmodified cache lines- TSO: Total Store Order
- The order generally implemented by x86 chips; x86-TSO formalizes these semantics
- Other, weaker memory orders exist, especially Arm
- See memory ordering for more
- Thus, atomics and related calls: the compiler convinces the architecture to force the appropriate memory ordering
Synchronization goals
- Throughput
- Number of successful lock acquisitions per second
- Latency
- Delay imposed by lock acquisition
- Fairness
- Lack of bias in lock acquisition
- 2 threads in a
while (true) { l.lock(); l.unlock(); }
loop for the same lock should acquire the lock about the same number of times
- Space overhead
- Bytes per lock
- Ease of use
- Likelihood that simple code using locks is correct
- These goals can be in conflict!
Parameter: Contention
- Contention measures the number of different threads that simultaneously (or nearly-simultaneously) need exclusive access to state
- Low contention: Few threads need exclusive access
- Easy case!
- Throughput doesn’t matter
- Fairness doesn’t matter much (why not?)
- Latency matters
- High contention: Many threads need exclusive access
- Inevitably low performance
- Throughput: Avoid contention collapse (starvation)
- Fairness matters
Parameter: Granularity
- Granularity measures the number of locks in the system and the amount of state protected by each lock
- Coarse granularity: Each lock protects lots of state; there are few locks
- Often has higher contention
- Often easier to program
- Can afford greater space overhead
- Fine granularity: Each lock protects relatively little state; there are many locks
- Often has lower contention
- Often harder to use
- Space overhead per lock becomes important
Parameter: Write balance (write weight)
- Write balance measures the fraction of operations that modify the protected state
- Read-heavy workload: Most operations only observe the protected state
- Often supports non-exclusive locking: two threads that merely observe shared state can run in parallel
- Allowed by Fundamental Law of Synchronization
- Write-heavy workload: Most operations modify the protected state
- Requires exclusive locking in many workloads
- Balanced workload (many reads and many writes)
Write balance examples
- Counters (“number of system calls executed since boot”)
- Pipes (
read
andwrite
) struct proc
components- Configuration data
Parameter: Thread count
- Number of threads = number of cores
- Often best raw performance
- Usually situation in kernel (kernel explicitly schedules threads among cores)
- Number of threads > number of cores
- Programming convenience
- Can complicate some locking designs
Architecture
- Context: Modern multicore shared-memory machines
- Synchronization is founded on atomic instructions on shared memory
- Those instructions are based on a cache coherence protocol
MESI
- Each cache line is in one of four states
- Modified
- Exclusive
- Shared
- Invalid
- (Also MOESI, MESIF, …; tons of innovation within this basic idea)
- Caches send messages to load lines, keep in sync
MESI
- Modified
- The cache’s data is newer than primary memory
- The cache line must be written back to primary memory before another cache can load it
- Exclusive
- The cache’s data is the same as primary memory
- The cache line is stored exactly one cache
- Shared
- The cache’s data is the same as primary memory
- The cache line is stored in one or more caches
- Invalid
- The cache’s data may be older than primary memory
- The processor must load the line before using it
MESI transition examples
- Core 1 I→S: Core 1 loads a line for physical address
pa
- Core 1 S→M: Core 1 modifies that line
- Core 1 M→S: Core 2 loads the same line
- Cache at core 1 must write the line back
- Core 1 S→I: Core 2 loads the line in exclusive mode (because it will write it)
- Core 1 I→E: Core 1 loads a line for future write/atomic access
MESI consequences
- On MESI-like machines, all synchronization is implemented using locks
- Cache lines are locked (M/E states)
- The locks cannot be held forever, and there is no deadlock, but performance properties resemble those of locks
- State is measured in cache line units
Simple test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
}
}
void unlock() {
f_.clear();
}
};
pause
pause — spin loop hint
Opcode Mnemonic Description
F3 90 PAUSE Gives hint to processor that improves
performance of spin-wait loops.
Description
Improves the performance of spin-wait loops. When
executing a "spin-wait loop," a Pentium 4 or Intel Xeon
processor suffers a severe performance penalty when
exiting the loop because it detects a possible memory
order violation. The PAUSE instruction provides a hint to
the processor that the code sequence is a spin-wait loop.
The processor uses this hint to avoid the memory order
violation in most situations, which greatly improves
processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.
Paused test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
pause(); // compiles to `pause`
}
}
void unlock() {
f_.clear();
}
};
- Contention level (latency vs. throughput)
Yielding test and set lock
struct spinlock {
std::atomic_flag f_;
void lock() {
while (f_.test_and_set()) {
sched_yield();
}
}
void unlock() {
f_.clear();
}
};
- Thread count
Fairness
-
Example unfairness with TAS lock
2485624 2519544 2486734 2438033
-
One thread obtains the lock 3% more than another!
-
Not the full story
Fairness
-
Count number of times a lock is obtained multiple times consecutively by the same thread
2101989 (3156) 2129185 (4436) 2139257 (4579) 2108552 (5182)
-
>96% of lock acquisitions alternate with another thread
-
But >0.1% of lock acquisitions occur in a consecutive sequence of >28 acquisitions by the same thread
Solving fairness
- Want a mechanism to serve lock requests in order
- Must define an order
Ticket lock
struct ticket_lock {
std::atomic<unsigned> now_ = 0;
std::atomic<unsigned> next_ = 0;
void lock() {
unsigned me = next_++;
while (me != now_) {
pause();
}
}
void unlock() {
now_++;
}
};
- Sleeping?