Wait queues

Motivation

Scheduling is a basic kernel responsibility. Kernels strive to divide all computer resources, including CPU, efficiently and fairly among jobs according to OS policy. This takes a ton of effort, all of which eventually boils down to efficiently and correctly putting tasks to sleep and waking them up again.

Think of it this way. Processes often wait for external events. Your editor waits for key presses; your browser waits for interactions (and/or animations); web servers wait for network connections. Efficiency demands that processes engaged only in waiting should use as little CPU as possible—ideally zero.

A blocking system call informs the kernel that a process is waiting, since the system call returns to its caller only when a specific external event (or interruption) occurs. But in Chickadee so far, waiting processes take far more than 0% CPU. Here’s a natural implementation of a sleep system call:

while (long(wakeup_time - ticks) > 0) {
    this->yield();
}

This implementation polls by repeatedly checking whether the desired time has arrived. The kernel task corresponding to the process remains runnable the whole time! With an implementation like this, the task corresponding to a p-testmsleep process runs more than a hundred thousand times. That wasted time would be better spent on actual work.

Sleep-wakeup

We seek a coordination mechanism that allows a kernel task to block—to become non-runnable—until a condition becomes true. Generalizing to abstract “threads” rather than kernel tasks:

Let \(x\) be some subset of program state. \(x\) can be as much as the whole program or as little as a single integer.
A sleeping thread has blocked until a condition \(f(x)\) becomes true. It is an error for the sleeping thread to remain blocked after \(f(x)\) becomes true.
A waking thread is a thread that modifies \(x\) in a way that might cause a sleeping thread to unblock and become runnable.

The inherent difficulty of sleep–wakeup is that sleeping and waking threads must synchronize. If they do not, a sleeping thread might remain blocked forever! We’re worried about this interleaving of events, which is a classic lost wakeup or sleep–wakeup race:

sleeping thread	waking thread
computes f(x) == false;
decides to block;
	modifies x so f(x) becomes true;
	checks if any other thread is blocked (there are none);
	does something else;
blocks forever;

How to fix this? A lock, such as a spinlock, can prevent bad interleavings, but blocking makes this bad interleaving tricky! Normally we’d associate x (the shared state) with a lock, say x_lock, and place lock calls on entry to the critical section and unlock calls after it:

sleeping thread	waking thread
locks `x_lock`;
computes f(x) == false;
decides to block;
	tries to lock `x_lock` (blocks…)
blocks forever, holding `x_lock`;

This won’t work at all. It’s immoral to block while holding a lock. Doing so can deadlock the system, as it will here: the waking thread never wakes up the sleeping thread because the waking thread can’t acquire x_lock.

But moving “unlock x_lock” before the block point just recreates the original synchronization problem.

sleeping thread	waking thread
locks `x_lock`;
computes f(x) == false;
decides to block;
unlocks `x_lock`;
	locks `x_lock`;
	modifies x so f(x) becomes true;
	checks if any other thread is blocked (there are none);
blocks (forever);
	unlocks `x_lock`;
	does something else;

So we need a lock for synchronization; we must not hold the lock while blocking; and we must not release the lock before blocking! What is to be done?

Condition variables

In user-level programming, this problem is typically solved with synchronization objects called condition variables. The magic of condition variables is that they offer an atomic operation, cond_wait, that both releases a lock and blocks the calling thread. Here’s how the running example would work with cond_wait:

sleeping thread	waking thread
locks `x_lock`;
computes f(x) == false;
decides to block;
calls `cond_wait(&x_cond, &x_lock)`;
	locks `x_lock`;
	modifies x so f(x) becomes true;
	calls `cond_signal(&x_cond)`, waking up the sleeping thread;

Since cond_wait releases the lock and blocks in one atomic step, there is no way for the waking thread to sneak in at the wrong place. Either the waking thread runs entirely before the sleeping thread acquires the lock, or the the waking thread runs entirely after cond_wait. And either run order has the right result. If the waking thread runs first, the sleeping thread won’t block; if the sleeping thread runs first, the waking thread will observe the sleeping thread and wake it up.

But we’re programming a kernel. If we want a condition variable, we will have to program it ourselves—and perhaps in kernel context a different abstraction will be preferred.

How other kernels do it

OS/161, the previous CS 161 teaching kernel, uses this function:

void wchan_sleep(struct wchan *wc, struct spinlock *lk);

which atomically blocks the current thread (a later call to wchan_wake…(wc) will wake it) and releases the spinlock lk. Sound familar? This is a condition-variable wait operation! In implementation, wchan_sleep calls a more basic function, thread_switch, which handles all voluntary kernel context switches. The need for atomic unlock-and-block infects that foundational function too:

/*
 * If NEWSTATE is S_SLEEP, the thread is queued on the wait channel
 * WC, protected by the spinlock LK. Otherwise WC and Lk should be
 * NULL.
 */
static void thread_switch(threadstate_t newstate, struct wchan *wc, struct spinlock *lk);

Since the foundational thread switching function uses wchans, any blocking primitive you think up must in OS/161 be expressed using wchans.

The xv6 teaching operating system has a similar design, except that “wait channels” are left undefined (they are opaque pointers).

// Atomically release lock and sleep on chan.
// Reacquires lock when awakened.
void sleep(void *chan, struct spinlock *lk);

This resembles the FreeBSD operating system’s general sleep function, _sleep. Like xv6 sleep, FreeBSD _sleep takes a void* ident (an opaque identifier for a wait channel) and a struct lock_object* lock. But years of use and feature development have added more stuff:

/*
 * General sleep call.  Suspends the current thread until a wakeup is
 * performed on the specified identifier.  The thread will then be made
 * runnable with the specified priority.  Sleeps at most sbt units of time
 * (0 means no timeout).  If pri includes the PCATCH flag, let signals
 * interrupt the sleep, otherwise ignore them while sleeping.  Returns 0 if
 * awakened, EWOULDBLOCK if the timeout expires.  If PCATCH is set and a
 * signal becomes pending, ERESTART is returned if the current system
 * call should be restarted if possible, and EINTR is returned if the system
 * call should be interrupted by the signal (return EINTR).
 *
 * The lock argument is unlocked before the caller is suspended, and
 * re-locked before _sleep() returns.  If priority includes the PDROP
 * flag the lock is not re-locked before returning.
 */
int _sleep(void *ident, struct lock_object *lock, int priority,
    const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags);

All these operating systems implement sleep-wakeup by pushing the atomic unlock-and-block step to the core of the operating system. But that’s not the only way. Rather than joining unlocking and blocking into a single atomic step, we can separate blocking into two steps, enqueuing and scheduling. With blocking unpacked in this way, sleep–wakeup can be implemented with less core scheduler involvement. Linux, and Chickadee, adopt this design, using synchronization objects called wait queues.

Chickadee wait queues are used like this.

wait_queue waitq; ...
function_that_might_sleep() { ...
    waiter w;
    while (true) {
        w.prepare(waitq);
        if (this process should wake up) {
            break;
        }
        w.block();
    }
    w.clear();
}

Here’s what these functions do.

waiter::prepare(waitq)—This function prepares the waiter to sleep on the named wait queue, and implements the “enqueue” part of blocking. Specifically, the function:
- Marks the waiter as associated with the current process.
- Locks the waitq data structure.
- Sets current()->pstate_ to proc::ps_blocked. This means the state is blocked even though the associated kernel task is running!
- Adds the waiter to a linked list of waiters associated with the waitq.
- Unlocks the waitq data structure.
waiter::clear()—This function cleans up the waiter after the process has woken up, by undoing the effect of any preceding waiter::prepare. Specifically, the function:
- Locks the waitq data structure.
- Calls proc::wake() on the associated process, which changes the process's pstate_ from proc::ps_blocked to proc::ps_runnable.
- Removes the waiter from the linked list, if it is currently linked.
- Unlocks the waitq data structure.
waiter::block()—This function implements the “block” part of blocking. Specifically:
- If the process has pstate_ == blocked, call proc::yield().
- Regardless, the end of the function calls waiter::clear() to clean up the waiter.

It’s very important that interrupts be disabled throughout this whole procedure, or an ill-timed timer interrupt could cause the kernel task to block forever.

The wait_queue::wake_all() function wakes up all waiters blocked on a the queue. It works like this:

wait_queue waitq; ...
auto irqs = waitq.lock_.lock();
while (auto w = waitq.q_.pop_front()) {
    w->wake();
}
waitq.lock_.unlock(irqs);

This code uses the waiter::wake() function, which you must implement. Our version checks if the waiter’s proc is blocked, and if it is, sets its state to ps_runnable and enqueues it on the relevant cpustate’s run queue. It calls a function like this on proc:

inline void proc::wake(...) {
    int s = ps_blocked;
    if (pstate_.compare_exchange_strong(s, ps_runnable)) {
        cpus[cpu_].enqueue(this);
    }
}

proc::wake() is a fine design, but you may want to experiment with a different one.

Synchronization

The separation of enqueuing and blocking is key to synchronization here. Let’s enumerate how sleeping and waking up can interleave in this design; we’ll see that every interleaving works OK.

The waking thread can run before waiter::prepare. No problem: the sleeping thread will not block (it will call break to exit the while loop).
The waking thread can run after waiter::prepare, but before waiter::block. No problem: the waking thread resets the sleeping thread’s proc::pstate_ to ps_runnable, so the sleeping thread will not in fact block (its yield call will keep it on the run queue).
The waking thread can run after waiter::block. No problem: the waking thread will wake up the sleeping thread as intended.

The wait queue design implements its own synchronization (the wait_queue has an embedded spinlock), and works correctly even if the sleeping and waking threads don’t otherwise synchronize. But of course sleeping and waking threads will often have other synchronization requirements and manipulate other spinlocks, such as ptable_lock.

Convenience methods

Our waiter class offers some convenience methods for working with wait queues. For instance, this code waits until x == 3; it assumes x is protected by x_lock, and changes to x are signaled on a wait queue wq.

void wait_for_x() { ...
    spinlock_guard guard(x_lock);     // protects access to `x`
    waiter w;
    while (true) {
        w.prepare(wq);
        if (x == 3) {
            break;
        }
        guard.unlock();               // release `x_lock` while blocking
        w.block();
        guard.lock();
    }
    w.clear();
    /* ... do something to `x` ... */
}

This is more simply expressed using waiter::block_until() and a C++ lambda:

void wait_for_x(proc* p) { ...
    spinlock_guard guard(x_lock);
    auto irqs = waiter().block_until(wq, [&] () {
        // this is called with `x_lock` held
        return x == 3;
    }, guard);
    /* ... do something to `x` ... */
}

The block_until implementation takes care of all the prepare, lock, unlock, block, and clear calls.

Tradeoffs

Wait queues are not as integrated with core OS scheduling primitives as blocking mechanisms in other kernels. That means we can imagine using other mechanisms to block, such as a fancy prioritized wait tree. I value the insight that correct synchronization can be obtained by splitting functionality into pieces, as well as by joining functionality into an atomic step.

But the other kernels’ mechanisms are simpler to specify. (They are just as hard to implement.)

Linux’s implementation of wait queues doesn’t lock as often as Chickadee’s does. Linux developers have reasoned carefully about where locks are truly necessary and where other synchronization primitives suffice.

True blocking is important, but we shouldn’t exaggerate its importance. Many sensible sleep–wakeup mechanisms involve a mix of true blocking and kernel polling (which is much better than user-level polling). This is because it can be expensive and complex to create a wait queue or channel for every specific condition a thread might block upon. Blocking can have high cost, in complexity as well as in queue management and context switches.