Motivation
Scheduling is a basic kernel responsibility. Kernels strive to divide all computer resources, including CPU, efficiently and fairly among jobs according to OS policy. This takes a ton of effort, all of which eventually boils down to efficiently and correctly putting tasks to sleep and waking them up again.
Think of it this way. Processes often wait for external events. Your editor waits for key presses; your browser waits for interactions (and/or animations); web servers wait for network connections. Efficiency demands that processes engaged only in waiting should use as little CPU as possible—ideally zero.
A blocking system call informs the kernel that a process is waiting, since the system call returns to its caller only when a specific external event (or interruption) occurs. But in Chickadee so far, waiting tasks take far more than 0% CPU. Here’s a natural implementation of a sleep system call:
while (long(wakeup_time - ticks) > 0) {
this->yield();
}
This implementation polls by repeatedly checking whether the desired time
has arrived. The kernel task corresponding to the process remains runnable the
whole time! With an implementation like this, the task corresponding to a
p-testmsleep
process runs more than a hundred thousand times. That wasted
time would be better spent on actual work.
Sleep-wakeup
We seek a coordination mechanism that allows a kernel task to block—to become non-runnable—until a condition becomes true. Generalizing to abstract “threads” rather than kernel tasks:
-
Let \(x\) be some subset of program state. \(x\) can be as much as the whole program or as little as a single integer.
-
A sleeping thread has blocked until a condition \(f(x)\) becomes true. It is an error for the sleeping thread to remain blocked after \(f(x)\) becomes true.
-
A waking thread is a thread that modifies \(x\) in a way that might cause a sleeping thread to unblock and become runnable.
The inherent difficulty of sleep–wakeup is that sleeping and waking threads must synchronize. If they do not, a sleeping thread might remain blocked forever! We’re worried about this interleaving of events, which is a classic lost wakeup or sleep–wakeup race:
sleeping thread | waking thread |
---|---|
computes f(x) == false; | |
decides to block, but doesn’t block yet; |
|
modifies x so f(x) becomes true; | |
checks if any other thread is blocked (there are none); | |
does something else; | |
blocks forever; |
How to fix this? A lock, such as a spinlock, can prevent bad interleavings,
but blocking makes this bad interleaving tricky! Normally we’d associate x
(the shared state) with a lock, say x_lock
, and place lock
calls on entry
to the critical section and unlock
calls after it:
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
tries to lock x_lock (blocks…) |
|
blocks forever, holding x_lock ; |
This won’t work at all. It’s immoral to block while holding a lock. Doing so
can deadlock the system, as it will here: the waking thread never wakes up the
sleeping thread because the waking thread can’t acquire x_lock
.
But moving “unlock x_lock
” before the block point just recreates the
original synchronization problem.
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
unlocks x_lock ; |
|
locks x_lock ; |
|
modifies x so f(x) becomes true; | |
checks if any other thread is blocked (there are none); | |
blocks (forever); | |
unlocks x_lock ; |
|
does something else; |
So we need a lock for synchronization; we must not hold the lock while blocking; and we must not release the lock before blocking! What is to be done?
Condition variables
In user-level programming, this problem is typically solved with
synchronization objects called condition variables. The magic of condition
variables is that they offer an atomic operation, cond_wait
, that both
releases a lock and blocks the calling thread. Here’s how the running example
would work with cond_wait
:
sleeping thread | waking thread |
---|---|
locks x_lock ; |
|
computes f(x) == false; | |
decides to block; | |
calls cond_wait(&x_cond, &x_lock) ; |
|
locks x_lock ; |
|
modifies x so f(x) becomes true; | |
calls cond_signal(&x_cond) , waking up the sleeping thread; |
Since cond_wait
releases the lock and blocks in one atomic step, there is no
way for the waking thread to sneak in at the wrong place. Either the waking
thread runs entirely before the sleeping thread acquires the lock, or the the
waking thread runs entirely after cond_wait
. And either run order has the
right result. If the waking thread runs first, the sleeping thread won’t
block; if the sleeping thread runs first, the waking thread will observe the
sleeping thread and wake it up.
But we’re programming a kernel. If we want a condition variable, we will have to program it ourselves—and perhaps in kernel context a different abstraction will be preferred.
How other kernels do it
OS/161, an older CS 161 teaching kernel, uses this function:
void wchan_sleep(struct wchan *wc, struct spinlock *lk);
which atomically blocks the current thread (a later call to wchan_wake…(wc)
will wake it) and releases the spinlock lk
. This is a
condition-variable wait operation! wchan_sleep
’s implementation calls a
more basic function, thread_switch
, which handles all voluntary kernel
context switches. The need for atomic unlock-and-block infects that
foundational function too:
/*
* If NEWSTATE is S_SLEEP, the thread is queued on the wait channel
* WC, protected by the spinlock LK. Otherwise WC and Lk should be
* NULL.
*/
static void thread_switch(threadstate_t newstate, struct wchan *wc, struct spinlock *lk);
Since the foundational thread switching function uses wchan
s, any blocking
primitive you think up must in OS/161 be expressed using wchan
s.
The xv6 teaching operating system has a similar design, except that “wait channels” are left undefined (they are opaque pointers).
// Atomically release lock and sleep on chan.
// Reacquires lock when awakened.
void sleep(void *chan, struct spinlock *lk);
This resembles the FreeBSD operating system’s general sleep function,
_sleep
. Like xv6 sleep
, FreeBSD _sleep
takes a void* ident
(an opaque
identifier for a wait channel) and a struct lock_object* lock
. But years of
use and feature development have added more stuff:
/*
* General sleep call. Suspends the current thread until a wakeup is
* performed on the specified identifier. The thread will then be made
* runnable with the specified priority. Sleeps at most sbt units of time
* (0 means no timeout). If pri includes the PCATCH flag, let signals
* interrupt the sleep, otherwise ignore them while sleeping. Returns 0 if
* awakened, EWOULDBLOCK if the timeout expires. If PCATCH is set and a
* signal becomes pending, ERESTART is returned if the current system
* call should be restarted if possible, and EINTR is returned if the system
* call should be interrupted by the signal (return EINTR).
*
* The lock argument is unlocked before the caller is suspended, and
* re-locked before _sleep() returns. If priority includes the PDROP
* flag the lock is not re-locked before returning.
*/
int _sleep(void *ident, struct lock_object *lock, int priority,
const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags);
All these operating systems implement sleep-wakeup by pushing the atomic unlock-and-block step to the core of the operating system. But that’s not the only way. Rather than joining unlocking and blocking into a single atomic step, we can separate blocking into two steps, enqueuing and scheduling. With blocking unpacked in this way, sleep–wakeup can be implemented with less core scheduler involvement. Linux, and Chickadee, adopt this design, using synchronization objects called wait queues.
Wait queues
Chickadee wait queues are used as follows. To block the current task until the
expression should_wake_up()
is true, write the following:
wait_queue wq; ...
function_that_might_sleep() { ...
waiter w;
spinlock_guard guard(/* some appropriate lock */);
w.block_until(wq, [&] () {
return should_wake_up();
}, guard);
...
}
This code assumes that guard
protects the state necessary for
should_wake_up()
to run safely.
To wake up any waiting tasks (for instance, because actions by a different
task have caused should_wake_up()
to become true), write the following:
function_that_wakes_up_wq() {
wq.wake_all();
}
The waiter::block_until
function is shorthand for something a little more
explicit, and the more explicit version shows exactly how the synchronization
works:
wait_queue wq; ...
function_that_might_sleep() { ...
waiter w;
spinlock_guard guard(/* some appropriate lock */);
// begin waiter::block_until()
while (true) {
w.prepare(wq);
if (should_wake_up()) {
break;
}
guard.unlock();
w.maybe_block();
guard.lock();
}
w.clear();
// end waiter::block_until()
...
}
Note that should_wake_up()
runs while the guard
is locked, and if
should_wake_up()
returns true, the guard
remains locked for the rest of
the function. However, if should_wake_up()
returns false, the code unlocks
the guard
before blocking, then acquires it again upon resume.
Functions like w.prepare(wq)
, w.maybe_block()
, w.clear()
, and
wq.wake_all()
might run concurrently on different CPUs, and access shared
state (namely, the wait queue), so these functions all have internal
synchronization too: they lock the wait_queue
’s associated spinlock before
examining or modifying wait queue state.
You are expected to implement three functions, waiter::prepare()
,
waiter::maybe_block()
, and waiter::clear()
.
waiter::prepare(wq)
—This function prepares the waiter to sleep on a wait queue, and implements the “enqueue” part of blocking. Specifically, the function:- Marks the waiter as associated with the current task and with
wq
. - Locks
wq
. - Sets
current()->pstate_
toproc::ps_blocked
. This means the state is blocked even though the associated kernel task is running. (That’s OK according to the synchronization invariants.) - Adds the
waiter
to a linked list of waiters associated with thewq
. - Unlocks
wq
.
- Marks the waiter as associated with the current task and with
waiter::maybe_block()
—This function implements the “block” part of blocking. Specifically:- If the associated task has
p_->pstate_ == proc::ps_blocked
, it callsproc::yield()
. - Whether or not
proc::yield()
was called, it then locks the associatedwq_
; checks whether thewaiter
is on the associated linked list; removes it from that list if it is linked; and unlockswq_
.
- If the associated task has
waiter::clear()
—This function cleans up the waiter. It might be called after a wakeup, or it might be called after the loop decided to wake up. It undoes the effect of any precedingwaiter::prepare
. Specifically, the function:- Locks the associated
wait_queue
data structurewq_
. - If the
waiter
was on the associated linked list, removes it. - If the associated task has
p_->pstate_ == proc::ps_blocked
, sets itspstate_
toproc::ps_runnable
. - Unlocks
wq_
.
- Locks the associated
It’s important that interrupts be disabled while a potentially-runnable
process has state proc::ps_blocked
(e.g., in between waiter::prepare()
,
should_wake_up()
, and waiter::clear()
). If interrupts were enabled, then
an ill-timed timer interrupt could cause the kernel task to block forever.
Normally interrupts are disabled by a spinlock_guard
associated with the
predicate state; regardless, this is not the wait queue’s responsibility.
The wait_queue::wake_all()
function wakes up all waiters blocked on a the queue.
It works like this:
wait_queue wq; ...
auto irqs = wq.lock_.lock();
while (auto w = wq.q_.pop_front()) {
w->wake();
}
wq.lock_.unlock(irqs);
This code uses the waiter::wake()
function, which, in our code, just calls
proc::unblock()
:
inline void proc::unblock() {
int s = ps_blocked;
if (pstate_.compare_exchange_strong(s, ps_runnable)) {
cpus[runq_cpu_].enqueue(this);
}
}
You can change waiter::wake()
if you want.
Synchronization
The separation of enqueuing and blocking is key to synchronization here. Let’s enumerate how sleeping and waking up can interleave in this design; we’ll see that every interleaving works OK.
- The waking thread might run before
waiter::prepare
. No problem: the sleeping thread will not block (it will callbreak
to exit the while loop). - The waking thread might run after
waiter::prepare
, but beforewaiter::maybe_block
. No problem: the waking thread resets the sleeping thread’sproc::pstate_
tops_runnable
, so the sleeping thread will not in fact block (either it will not callyield()
, or itsyield()
call will keep it on the run queue). - The waking thread might run after
waiter::maybe_block
. No problem: the waking thread will wake up the sleeping thread as intended.
Convenience methods
Our waiter
class offers some convenience methods for working with wait
queues. For instance, this code waits until x == 3
; it assumes x
is
protected by x_lock
, and changes to x
are signaled on a wait queue wq
.
void wait_for_x() { ...
spinlock_guard guard(x_lock); // protects access to `x`
waiter w;
while (true) {
w.prepare(wq);
if (x == 3) {
break;
}
guard.unlock(); // release `x_lock` while blocking
w.maybe_block();
guard.lock();
}
w.clear();
/* ... do something to `x` ... */
}
This is more simply expressed using waiter::block_until()
and a C++ lambda:
void wait_for_x(proc* p) { ...
spinlock_guard guard(x_lock);
auto irqs = waiter().block_until(wq, [&] () {
// this is called with `x_lock` held
return x == 3;
}, guard);
/* ... do something to `x` ... */
}
The block_until
implementation takes care of all the prepare
, lock
,
unlock
, maybe_block
, and clear
calls.
Tradeoffs
Wait queues are not as integrated with core OS scheduling primitives as blocking mechanisms in other kernels. That means we can imagine using other mechanisms to block, such as a fancy prioritized wait tree. It’s an interesting insight that correct synchronization can be obtained by splitting functionality into pieces, as well as by joining functionality into an atomic step. However, the other kernels’ mechanisms are simpler to specify. (They are just as hard to implement.)
Linux’s implementation of wait queues doesn’t lock as often as Chickadee’s does. Linux developers have reasoned carefully about where locks are truly necessary and where other synchronization primitives suffice.
True blocking is important, but we shouldn’t exaggerate its importance. Many sensible sleep–wakeup mechanisms involve a mix of true blocking and kernel polling (which is much better than user-level polling). This is because it can be expensive and complex to create a wait queue or channel for every specific condition a thread might block upon. Blocking can have high cost, in complexity as well as in queue management and context switches.