** SYNCHRONIZATION PRIMITIVES ** - Here's some code. struct List { int data; struct List *next; }; List *list = 0; insert(int data) { List *l = new List; l->data = data; l->next = list; // A list = l; // B } fn1() { insert(1); } fn2() { insert(2); } main() { thread_create(..., fn1); // Thread 1 thread_create(..., fn2); // Thread 2 thread_schedule(); } - Assuming preemptive scheduling, what might the list contain once the threads complete? -- Answer: Any of "1, 2", "2, 1", "1", or "2". -- Example: How can "1" happen? Run thread 1 up until A and block; run thread 2 to completion (result: list contains "2"); then run thread 1 again (result: list set to "1"). - How can we avoid this? -- Use atomic sections. -- For example: insert(int data) { List *l = new List; l->data = data; atomic_begin(); // acquire a lock, for example l->next = list; list = l; atomic_end(); // release the lock } -- Note that malloc() (called by "new List") must be thread safe too! - How can we build atomic sections and other synchronization primitives? -- At low cost? -- On a multiprocessor? -- That's the point of the paper! - All atomic operations are built on hardware primitives -- For example, fetch_and_\Phi *** store (like IA-32 xchgl) *** increment, decrement, add, subtract, and, or *** compare_and_swap -- What does the x86 provide? ** IA-32 Guaranteed Atomic Operations ** - Reading or writing -- Reading OR writing a byte -- Reading OR writing 16 bits aligned on a 16-bit boundary -- Reading OR writing 32 bits aligned on a 32-bit boundary -- Reading OR writing 64 bits aligned on a 64-bit boundary (Pentium+) -- unaligned 16-bit accesses to uncached memory (Pentium+) -- unaligned accesses to cached memory that fits in a cache line (P6+) -- Note that an instruction "incl MEM" both reads AND writes! So that isn't covered here. -- These rules mean that if one processor does "movl $1, MEM" and another does "movl $0x10000, MEM", AND MEM IS ALIGNED, then MEM will contain either $1 or $0x10000. *** But if MEM is not aligned, then it might contain, say, $0x10001! - Locking the bus: ensures that an operation completes atomically -- Mark aromic sections with {* *} -- xchgl %eax, memory (implicitly locks the bus) -- lock; xaddl %1, %2 ::= {* tmp = %1; %1 = %2; %2 += tmp; *} -- lock; cmpxchgl %1, %2 ::= {* if (%2 == %eax) ZF = 1, %2 = %1; else ZF = 0, %eax = %2; *} compare_and_swap(addr, old, new) ::= {* if (*addr != old) return false; else { *addr = new; return true; } *} compare_and_swap(a, o, n) ::= movl old, %eax; lock; cmpxchgl new, addr; *** cmpxchgl not supported on 386 (only 486+) -- lock; {inc,dec,not,neg,add,adc,sub,sbb,and,or,xor} -- lock; {btsET,btrESET,btcOMPLEMENT} - More in IA-32 Volume 3 manual. - So how might we build an atomic section using these principles? -- uint32_t lock; /* if 0, not locked; if 1, locked */ -- acquire_lock() { movl $1, %eax again: xchgl %eax, lock cmpl $0, %eax jne again } *** What does this do? xchgl swaps %eax and lock variable Afterwards, lock variable is always 1 Thus, "locked" This makes sense: If lock wasn't held before, then we hold it now; if someone else held lock, then they still do But %eax is 0 iff lock wasn't held before So succeed iff %eax is 0 This idiom is called **TEST-AND-SET** (Pseudocode: acquire_lock() { while (atomic_test_and_set(&lock) != 0) /* nada */; }) So far, so good? -- release_lock() { movl $0, lock } *** What does this do? Just set lock = 0 Will execute atomically with respect to acquire_lock() Done! - What problems does this have, particularly on a multiprocessor? -- Multiple writers to same word ("lock") on every lock attempt Each write will cause an entire cache line to potentially switch processors = hundreds of cycles Much cheaper to *read* from same cache line => don't need to switch cache line unless there's been a change Cheaper still to read and write from PROCESSOR-LOCAL memory -- Not fair A process can starve: other processes always win just by luck - How to fix? -- test_AND_TEST_and_set: acquire_lock() { while (!lock && atomic_test_and_set(&lock) != 0) /* nada */; } *** First "!lock" test eliminates a write to shared variable "lock" in many cases -- But still not fair -- A lot of read traffic too Exponential backoff can fix this - Algorithm 2: ticket lock + Just two memory locations for arbitrarily (exponentially) many threads + Fair (FIFO) + Probes with read operations only (unlike t&s which issues writes) - Still requires a lot of bandwidth everyone reading now_serving all the time, updated w. each switch + Can maybe estimate how long wait is to reduce reads, but hard high penalty if you guess wrong, because of perfect FIFO order - Requires fetch_and_add if implemented with a test_and_set spin lock, adds more memory traffic struct lock { uint32_t next_ticket; uint32_t now_serving; }; acquire_lock(struct lock* l) { uint32_t my_ticket = atomic_fetch_and_inc(&l->next_ticket); /* returns value before increment */ while (l->now_serving != my_ticket) pause(my_ticket - l->now_serving); /* pause proportional to # waiting processes */ } release_lock(struct lock* l) { l->now_serving++; } - How does the following lock differ? struct lock { int locked; int nwaiting; }; void acquire(struct lock* l) { atomic_inc(l->nwaiting); while (atomic_test_and_set(l->locked)) pause(l->nwaiting - 1); } void release(struct lock* l) { atomic_dec(l->nwaiting); l->locked = 0; } -- Doesn't provide FIFO Algorithm 5: MCS list-based queuing lock + Fair (almost FIFO, is FIFO if you have compare_and_swap) + Only spin on your own qnode I, which can be in local memory + Small amount of network traffic + Requires constant space per lock (plus each thread has a qnode) Each thread has a qnode (for each lock it owns/is waiting for) struct qnode { struct qnode *next; bool locked; /* should be called "blocked" */ }; typedef struct qnode* lock; acquire (lock* L, qnode* me) { me->next = NULL; qnode* predecessor = me; atomic_xchgw(predecessor, *L); if (predecessor) { me->locked = true; predecessor->next = me; while (predecessor->locked) /* nada */; } } release (lock* L, qnode* me) { if (!me->next) { if (atomic_compare_and_swap(*L, me, 0)) return; /* no one got in there */ while (!me->next) /* come on */; } /* know me->next == 0, because everyone else sets it to non-0 */ me->next->locked = 0; /* now me is useless */ } release (lock *L, qnode *me) { if (me->next) me->next->locked = false; else { qnode* old_tail = 0; ATOMIC_SWAP (*L, old_tail); if (old_tail == me) return; qnode *usurper = old_tail; ATOMIC_SWAP (*L, usurper); while (me->next == NULL) /* get in there */; if (usurper) usurper->next = me->next; else me->next->locked = false; } } - Barrier (Lecture 15) - Ideas -- "Sense-switching" Boolean variables avoid having to reinitialize the data structure -- Removing writes from the common path -- Writes to local data structures