Motivation
Scheduling is a basic kernel responsibility. Kernels strive to divide all computer resources, including CPU, efficiently and fairly among jobs according to OS policy. This takes a ton of effort, all of which eventually boils down to efficiently and correctly putting tasks to sleep and waking them up again.
Think of it this way. Processes often wait for external events. Your editor waits for key presses; your browser waits for interactions (and/or animations); web servers wait for network connections. Efficiency demands that processes engaged only in waiting should use as little CPU as possible—ideally zero.
A blocking system call informs the kernel that a process is waiting, since the system call returns to its caller only when a specific external event (or interruption) occurs. But in Chickadee so far, waiting processes take far more than 0% CPU. Here’s a natural implementation of a sleep system call:
while (long(wakeup_time - ticks) > 0) {
this->yield();
}
This implementation polls by repeatedly checking whether the desired time
has arrived. The kernel task corresponding to the process remains runnable the
whole time! With an implementation like this, the task corresponding to a
p-testmsleep
process runs more than a hundred thousand times. That wasted
time would be better spent on actual work.
Sleep-wakeup
We seek a coordination mechanism that allows a kernel task to block—to become non-runnable—until a condition becomes true. Generalizing to abstract “threads” rather than kernel tasks:
Let \(x\) be some subset of program state. \(x\) can be as much as the whole program or as little as a single integer.
A sleeping thread has blocked until a condition \(f(x)\) becomes true. It is an error for the sleeping thread to remain blocked after \(f(x)\) becomes true.
A waking thread is a thread that modifies \(x\) in a way that might cause a sleeping thread to unblock and become runnable.
The inherent difficulty of sleep–wakeup is that sleeping and waking threads must synchronize. If they do not, a sleeping thread might remain blocked forever! We’re worried about this interleaving of events, which is a classic lost wakeup or sleep–wakeup race:
sleeping thread | waking thread |
---|---|
computes f(x) == false; | |
decides to block; | |
modifies x so f(x) becomes true; | |
checks if any other thread is blocked (there are none); | |
does something else; | |
blocks forever; |
How to fix this? A lock, such as a spinlock, can prevent bad interleavings,
but blocking makes this bad interleaving tricky! Normally we’d associate x
(the shared state) with a lock, say x_lock
, and place lock
calls on entry
to the critical section and unlock
calls after it:
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
tries to lock x_lock (blocks…) |
|
blocks forever, holding x_lock ; |
This won’t work at all. It’s immoral to block while holding a lock. Doing so
can deadlock the system, as it will here: the waking thread never wakes up the
sleeping thread because the waking thread can’t acquire x_lock
.
But moving “unlock x_lock
” before the block point just recreates the
original synchronization problem.
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
unlocks x_lock ; |
|
locks x_lock ; |
|
modifies x so f(x) becomes true; | |
checks if any other thread is blocked (there are none); | |
blocks (forever); | |
unlocks x_lock ; |
|
does something else; |
So we need a lock for synchronization; we must not hold the lock while blocking; and we must not release the lock before blocking! What is to be done?
Condition variables
In user-level programming, this problem is typically solved with
synchronization objects called condition variables. The magic of condition
variables is that they offer an atomic operation, cond_wait
, that both
releases a lock and blocks the calling thread. Here’s how the running example
would work with cond_wait
:
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
calls cond_wait(&x_cond, &x_lock) ; |
|
locks x_lock ; |
|
modifies x so f(x) becomes true; | |
calls cond_signal(&x_cond) , waking up the sleeping thread; |
Since cond_wait
releases the lock and blocks in one atomic step, there is no
way for the waking thread to sneak in at the wrong place. Either the waking
thread runs entirely before the sleeping thread acquires the lock, or the the
waking thread runs entirely after cond_wait
. And either run order has the
right result. If the waking thread runs first, the sleeping thread won’t
block; if the sleeping thread runs first, the waking thread will observe the
sleeping thread and wake it up.
But we’re programming a kernel. If we want a condition variable, we will have to program it ourselves—and perhaps in kernel context a different abstraction will be preferred.
How other kernels do it
OS/161, the previous CS 161 teaching kernel, uses this function:
void wchan_sleep(struct wchan *wc, struct spinlock *lk);
which atomically blocks the current thread (a later call to wchan_wake…(wc)
will wake it) and releases the spinlock lk
. Sound familar? This is a
condition-variable wait operation! In implementation, wchan_sleep
calls a
more basic function, thread_switch
, which handles all voluntary kernel
context switches. The need for atomic unlock-and-block infects that
foundational function too:
/*
* If NEWSTATE is S_SLEEP, the thread is queued on the wait channel
* WC, protected by the spinlock LK. Otherwise WC and Lk should be
* NULL.
*/
static void thread_switch(threadstate_t newstate, struct wchan *wc, struct spinlock *lk);
Since the foundational thread switching function uses wchan
s, any blocking
primitive you think up must in OS/161 be expressed using wchan
s.
The xv6 teaching operating system has a similar design, except that “wait channels” are left undefined (they are opaque pointers).
// Atomically release lock and sleep on chan.
// Reacquires lock when awakened.
void sleep(void *chan, struct spinlock *lk);
This resembles the FreeBSD operating system’s general sleep function,
_sleep
. Like xv6 sleep
, FreeBSD _sleep
takes a void* ident
(an opaque
identifier for a wait channel) and a struct lock_object* lock
. But years of
use and feature development have added more stuff:
/*
* General sleep call. Suspends the current thread until a wakeup is
* performed on the specified identifier. The thread will then be made
* runnable with the specified priority. Sleeps at most sbt units of time
* (0 means no timeout). If pri includes the PCATCH flag, let signals
* interrupt the sleep, otherwise ignore them while sleeping. Returns 0 if
* awakened, EWOULDBLOCK if the timeout expires. If PCATCH is set and a
* signal becomes pending, ERESTART is returned if the current system
* call should be restarted if possible, and EINTR is returned if the system
* call should be interrupted by the signal (return EINTR).
*
* The lock argument is unlocked before the caller is suspended, and
* re-locked before _sleep() returns. If priority includes the PDROP
* flag the lock is not re-locked before returning.
*/
int _sleep(void *ident, struct lock_object *lock, int priority,
const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags);
All these operating systems implement sleep-wakeup by pushing the atomic unlock-and-block step to the core of the operating system. But that’s not the only way. Rather than joining unlocking and blocking into a single atomic step, we can separate blocking into two steps, enqueuing and scheduling. With blocking unpacked in this way, sleep–wakeup can be implemented with less core scheduler involvement. Linux, and Chickadee, adopt this design, using synchronization objects called wait queues.
Wait queues
Chickadee wait queues are used like this.
wait_queue waitq; ...
function_that_might_sleep() { ...
waiter w;
while (true) {
w.prepare(waitq);
if (this process should wake up) {
break;
}
w.block();
}
w.clear();
}
Here’s what these functions do.
waiter::prepare(waitq)
—This function prepares the waiter to sleep on the named wait queue, and implements the “enqueue” part of blocking. Specifically, the function:- Marks the waiter as associated with the current process.
- Locks the
waitq
data structure. - Sets
current()->pstate_
toproc::ps_blocked
. This means the state is blocked even though the associated kernel task is running! - Adds the
waiter
to a linked list of waiters associated with thewaitq
. - Unlocks the
waitq
data structure.
waiter::clear()
—This function cleans up the waiter after the process has woken up, by undoing the effect of any precedingwaiter::prepare
. Specifically, the function:- Locks the
waitq
data structure. - Calls
proc::wake()
on the associated process, which changes the process'spstate_
fromproc::ps_blocked
toproc::ps_runnable
. - Removes the
waiter
from the linked list, if it is currently linked. - Unlocks the
waitq
data structure.
- Locks the
waiter::block()
—This function implements the “block” part of blocking. Specifically:- If the process has
pstate_ == blocked
, callproc::yield()
. - Regardless, the end of the function calls
waiter::clear()
to clean up the waiter.
- If the process has
It’s very important that interrupts be disabled throughout this whole procedure, or an ill-timed timer interrupt could cause the kernel task to block forever.
The wait_queue::wake_all()
function wakes up all waiters blocked on a the queue.
It works like this:
wait_queue waitq; ...
auto irqs = waitq.lock_.lock();
while (auto w = waitq.q_.pop_front()) {
w->wake();
}
waitq.lock_.unlock(irqs);
This code uses the waiter::wake()
function, which you must implement. Our
version checks if the waiter’s proc
is blocked, and if it is, sets its state
to ps_runnable
and enqueues it on the relevant cpustate
’s run queue. It calls
a function like this on proc
:
inline void proc::wake(...) {
int s = ps_blocked;
if (pstate_.compare_exchange_strong(s, ps_runnable)) {
cpus[cpu_].enqueue(this);
}
}
proc::wake()
is a fine design, but you may want to experiment with a
different one.
Synchronization
The separation of enqueuing and blocking is key to synchronization here. Let’s enumerate how sleeping and waking up can interleave in this design; we’ll see that every interleaving works OK.
- The waking thread can run before
waiter::prepare
. No problem: the sleeping thread will not block (it will callbreak
to exit the while loop). - The waking thread can run after
waiter::prepare
, but beforewaiter::block
. No problem: the waking thread resets the sleeping thread’sproc::pstate_
tops_runnable
, so the sleeping thread will not in fact block (itsyield
call will keep it on the run queue). - The waking thread can run after
waiter::block
. No problem: the waking thread will wake up the sleeping thread as intended.
The wait queue design implements its own synchronization (the wait_queue
has
an embedded spinlock), and works correctly even if the sleeping and waking
threads don’t otherwise synchronize. But of course sleeping and waking threads
will often have other synchronization requirements and manipulate other
spinlocks, such as ptable_lock
.
Convenience methods
Our waiter
class offers some convenience methods for working with wait
queues. For instance, this code waits until x == 3
; it assumes x
is
protected by x_lock
, and changes to x
are signaled on a wait queue wq
.
void wait_for_x() { ...
spinlock_guard guard(x_lock); // protects access to `x`
waiter w;
while (true) {
w.prepare(wq);
if (x == 3) {
break;
}
guard.unlock(); // release `x_lock` while blocking
w.block();
guard.lock();
}
w.clear();
/* ... do something to `x` ... */
}
This is more simply expressed using waiter::block_until()
and a C++ lambda:
void wait_for_x(proc* p) { ...
spinlock_guard guard(x_lock);
auto irqs = waiter().block_until(wq, [&] () {
// this is called with `x_lock` held
return x == 3;
}, guard);
/* ... do something to `x` ... */
}
The block_until
implementation takes care of all the prepare
, lock
,
unlock
, block
, and clear
calls.
Tradeoffs
Wait queues are not as integrated with core OS scheduling primitives as blocking mechanisms in other kernels. That means we can imagine using other mechanisms to block, such as a fancy prioritized wait tree. I value the insight that correct synchronization can be obtained by splitting functionality into pieces, as well as by joining functionality into an atomic step.
But the other kernels’ mechanisms are simpler to specify. (They are just as hard to implement.)
Linux’s implementation of wait queues doesn’t lock as often as Chickadee’s does. Linux developers have reasoned carefully about where locks are truly necessary and where other synchronization primitives suffice.
True blocking is important, but we shouldn’t exaggerate its importance. Many sensible sleep–wakeup mechanisms involve a mix of true blocking and kernel polling (which is much better than user-level polling). This is because it can be expensive and complex to create a wait queue or channel for every specific condition a thread might block upon. Blocking can have high cost, in complexity as well as in queue management and context switches.