Wait queues

Motivation

Scheduling is a basic kernel responsibility. Kernels strive to divide all computer resources, including CPU, efficiently and fairly among jobs according to OS policy. This takes a ton of effort, all of which eventually boils down to efficiently and correctly putting tasks to sleep and waking them up again.

Think of it this way. Processes often wait for external events. Your editor waits for key presses; your browser waits for interactions (and/or animations); web servers wait for network connections. Efficiency demands that processes engaged only in waiting should use as little CPU as possible—ideally zero.

A blocking system call informs the kernel that a process is waiting, since the system call returns to its caller only when a specific external event (or interruption) occurs. But in Chickadee so far, waiting tasks take far more than 0% CPU. Here’s a natural implementation of a sleep system call:

while (long(wakeup_time - ticks) > 0) {
    this->yield();
}

This implementation polls by repeatedly checking whether the desired time has arrived. The kernel task corresponding to the process remains runnable the whole time! With an implementation like this, the task corresponding to a p-testmsleep process runs more than a hundred thousand times. That wasted time would be better spent on actual work.

Sleep-wakeup

We seek a coordination mechanism that allows a kernel task to block—to become non-runnable—until a condition becomes true. Generalizing to abstract “threads” rather than kernel tasks:

The inherent difficulty of sleep–wakeup is that sleeping and waking threads must synchronize. If they do not, a sleeping thread might remain blocked forever! We’re worried about this interleaving of events, which is a classic lost wakeup or sleep–wakeup race:

sleeping thread waking thread
computes f(x) == false;
decides to block,
but doesn’t block yet;
modifies x so f(x) becomes true;
checks if any other thread is blocked (there are none);
does something else;
blocks forever;

How to fix this? A lock, such as a spinlock, can prevent bad interleavings, but blocking makes this bad interleaving tricky! Normally we’d associate x (the shared state) with a lock, say x_lock, and place lock calls on entry to the critical section and unlock calls after it:

sleeping thread waking thread
locks x_lock;
computes f(x) == false;
decides to block;
tries to lock x_lock (blocks…)
blocks forever, holding x_lock;

This won’t work at all. It’s immoral to block while holding a lock. Doing so can deadlock the system, as it will here: the waking thread never wakes up the sleeping thread because the waking thread can’t acquire x_lock.

But moving “unlock x_lock” before the block point just recreates the original synchronization problem.

sleeping thread waking thread
locks x_lock;
computes f(x) == false;
decides to block;
unlocks x_lock;
locks x_lock;
modifies x so f(x) becomes true;
checks if any other thread is blocked (there are none);
blocks (forever);
unlocks x_lock;
does something else;

So we need a lock for synchronization; we must not hold the lock while blocking; and we must not release the lock before blocking! What is to be done?

Condition variables

In user-level programming, this problem is typically solved with synchronization objects called condition variables. The magic of condition variables is that they offer an atomic operation, cond_wait, that both releases a lock and blocks the calling thread. Here’s how the running example would work with cond_wait:

sleeping thread waking thread
locks x_lock;
computes f(x) == false;
decides to block;
calls cond_wait(&x_cond, &x_lock);
locks x_lock;
modifies x so f(x) becomes true;
calls cond_signal(&x_cond), waking up the sleeping thread;

Since cond_wait releases the lock and blocks in one atomic step, there is no way for the waking thread to sneak in at the wrong place. Either the waking thread runs entirely before the sleeping thread acquires the lock, or the the waking thread runs entirely after cond_wait. And either run order has the right result. If the waking thread runs first, the sleeping thread won’t block; if the sleeping thread runs first, the waking thread will observe the sleeping thread and wake it up.

But we’re programming a kernel. If we want a condition variable, we will have to program it ourselves—and perhaps in kernel context a different abstraction will be preferred.

How other kernels do it

OS/161, an older CS 161 teaching kernel, uses this function:

void wchan_sleep(struct wchan *wc, struct spinlock *lk);

which atomically blocks the current thread (a later call to wchan_wake…(wc) will wake it) and releases the spinlock lk. This is a condition-variable wait operation! wchan_sleep’s implementation calls a more basic function, thread_switch, which handles all voluntary kernel context switches. The need for atomic unlock-and-block infects that foundational function too:

/*
 * If NEWSTATE is S_SLEEP, the thread is queued on the wait channel
 * WC, protected by the spinlock LK. Otherwise WC and Lk should be
 * NULL.
 */
static void thread_switch(threadstate_t newstate, struct wchan *wc, struct spinlock *lk);

Since the foundational thread switching function uses wchans, any blocking primitive you think up must in OS/161 be expressed using wchans.

The xv6 teaching operating system has a similar design, except that “wait channels” are left undefined (they are opaque pointers).

// Atomically release lock and sleep on chan.
// Reacquires lock when awakened.
void sleep(void *chan, struct spinlock *lk);

This resembles the FreeBSD operating system’s general sleep function, _sleep. Like xv6 sleep, FreeBSD _sleep takes a void* ident (an opaque identifier for a wait channel) and a struct lock_object* lock. But years of use and feature development have added more stuff:

/*
 * General sleep call.  Suspends the current thread until a wakeup is
 * performed on the specified identifier.  The thread will then be made
 * runnable with the specified priority.  Sleeps at most sbt units of time
 * (0 means no timeout).  If pri includes the PCATCH flag, let signals
 * interrupt the sleep, otherwise ignore them while sleeping.  Returns 0 if
 * awakened, EWOULDBLOCK if the timeout expires.  If PCATCH is set and a
 * signal becomes pending, ERESTART is returned if the current system
 * call should be restarted if possible, and EINTR is returned if the system
 * call should be interrupted by the signal (return EINTR).
 *
 * The lock argument is unlocked before the caller is suspended, and
 * re-locked before _sleep() returns.  If priority includes the PDROP
 * flag the lock is not re-locked before returning.
 */
int _sleep(void *ident, struct lock_object *lock, int priority,
    const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags);

All these operating systems implement sleep-wakeup by pushing the atomic unlock-and-block step to the core of the operating system. But that’s not the only way. Rather than joining unlocking and blocking into a single atomic step, we can separate blocking into two steps, enqueuing and scheduling. With blocking unpacked in this way, sleep–wakeup can be implemented with less core scheduler involvement. Linux, and Chickadee, adopt this design, using synchronization objects called wait queues.

Wait queues

Chickadee wait queues are used as follows. To block the current task until the expression should_wake_up() is true, write the following:

wait_queue wq; ...
function_that_might_sleep() { ...
    waiter w;
    spinlock_guard guard(/* some appropriate lock */);
    w.block_until(wq, [&] () {
        return should_wake_up();
    }, guard);
    ...
}

This code assumes that guard protects the state necessary for should_wake_up() to run safely.

To wake up any waiting tasks (for instance, because actions by a different task have caused should_wake_up() to become true), write the following:

function_that_wakes_up_wq() {
    wq.wake_all();
}

The waiter::block_until function is shorthand for something a little more explicit, and the more explicit version shows exactly how the synchronization works:

wait_queue wq; ...
function_that_might_sleep() { ...
    waiter w;
    spinlock_guard guard(/* some appropriate lock */);
    // begin waiter::block_until()
    while (true) {
        w.prepare(wq);
        if (should_wake_up()) {
            break;
        }
        guard.unlock();
        w.maybe_block();
        guard.lock();
    }
    w.clear();
    // end waiter::block_until()
    ...
}

Note that should_wake_up() runs while the guard is locked, and if should_wake_up() returns true, the guard remains locked for the rest of the function. However, if should_wake_up() returns false, the code unlocks the guard before blocking, then acquires it again upon resume.

Functions like w.prepare(wq), w.maybe_block(), w.clear(), and wq.wake_all() might run concurrently on different CPUs, and access shared state (namely, the wait queue), so these functions all have internal synchronization too: they lock the wait_queue’s associated spinlock before examining or modifying wait queue state.

You are expected to implement three functions, waiter::prepare(), waiter::maybe_block(), and waiter::clear().

It’s important that interrupts be disabled while a potentially-runnable process has state proc::ps_blocked (e.g., in between waiter::prepare(), should_wake_up(), and waiter::clear()). If interrupts were enabled, then an ill-timed timer interrupt could cause the kernel task to block forever. Normally interrupts are disabled by a spinlock_guard associated with the predicate state; regardless, this is not the wait queue’s responsibility.

The wait_queue::wake_all() function wakes up all waiters blocked on a the queue. It works like this:

wait_queue wq; ...
auto irqs = wq.lock_.lock();
while (auto w = wq.q_.pop_front()) {
    w->wake();
}
wq.lock_.unlock(irqs);

This code uses the waiter::wake() function, which, in our code, just calls proc::unblock():

inline void proc::unblock() {
    int s = ps_blocked;
    if (pstate_.compare_exchange_strong(s, ps_runnable)) {
        cpus[runq_cpu_].enqueue(this);
    }
}

You can change waiter::wake() if you want.

Synchronization

The separation of enqueuing and blocking is key to synchronization here. Let’s enumerate how sleeping and waking up can interleave in this design; we’ll see that every interleaving works OK.

  1. The waking thread might run before waiter::prepare. No problem: the sleeping thread will not block (it will call break to exit the while loop).
  2. The waking thread might run after waiter::prepare, but before waiter::maybe_block. No problem: the waking thread resets the sleeping thread’s proc::pstate_ to ps_runnable, so the sleeping thread will not in fact block (either it will not call yield(), or its yield() call will keep it on the run queue).
  3. The waking thread might run after waiter::maybe_block. No problem: the waking thread will wake up the sleeping thread as intended.

Convenience methods

Our waiter class offers some convenience methods for working with wait queues. For instance, this code waits until x == 3; it assumes x is protected by x_lock, and changes to x are signaled on a wait queue wq.

void wait_for_x() { ...
    spinlock_guard guard(x_lock);     // protects access to `x`
    waiter w;
    while (true) {
        w.prepare(wq);
        if (x == 3) {
            break;
        }
        guard.unlock();               // release `x_lock` while blocking
        w.maybe_block();
        guard.lock();
    }
    w.clear();
    /* ... do something to `x` ... */
}

This is more simply expressed using waiter::block_until() and a C++ lambda:

void wait_for_x(proc* p) { ...
    spinlock_guard guard(x_lock);
    auto irqs = waiter().block_until(wq, [&] () {
        // this is called with `x_lock` held
        return x == 3;
    }, guard);
    /* ... do something to `x` ... */
}

The block_until implementation takes care of all the prepare, lock, unlock, maybe_block, and clear calls.

Tradeoffs

Wait queues are not as integrated with core OS scheduling primitives as blocking mechanisms in other kernels. That means we can imagine using other mechanisms to block, such as a fancy prioritized wait tree. It’s an interesting insight that correct synchronization can be obtained by splitting functionality into pieces, as well as by joining functionality into an atomic step. However, the other kernels’ mechanisms are simpler to specify. (They are just as hard to implement.)

Linux’s implementation of wait queues doesn’t lock as often as Chickadee’s does. Linux developers have reasoned carefully about where locks are truly necessary and where other synchronization primitives suffice.

True blocking is important, but we shouldn’t exaggerate its importance. Many sensible sleep–wakeup mechanisms involve a mix of true blocking and kernel polling (which is much better than user-level polling). This is because it can be expensive and complex to create a wait queue or channel for every specific condition a thread might block upon. Blocking can have high cost, in complexity as well as in queue management and context switches.