Contexts

The primary task for operating system kernels is to multiplex hardware resources among interested processes. This means kernels must switch contexts, from process to process and from process to kernel and back. Some context switches are voluntary; for example, a system call intentionally and synchronously transfers control to the kernel. In other context switches, a task switches contexts unexpectedly, because of an interrupt or other exception.

Context switches are difficult because they require changing internal machine state. Safely switching the CPU away from a task requires saving that task’s state so it can resume later. Saving the state involves computation, which in turn requires modifying registers—but in many cases, these registers can’t be modified until they’re saved! Context switching must not clobber important register state, so it requires careful coding and special hardware support.

Kinds of context

Chickadee supports three kinds of context: user context, kernel task context, and CPU context. User context is unprivileged; kernel task context and CPU context have full machine privilege.

A user context corresponds to unprivileged process code. When the processor is running in user context, the processor is unprivileged ((%cs & 3) == 3) and interrupts are always enabled.

Most kernel code runs in kernel task context. A kernel task is a schedulable entity that runs with kernel privilege. Chickadee can suspend kernel tasks; for instance, a Chickadee system call implementation can voluntarily or involuntarily give up control to another task, picking up later right where it left off. To make this work, each kernel task context has its own kernel task stack. When the processor is running a kernel task, the %rsp register points into the corresponding stack. Just like any stack, a kernel task stack holds local variables. It also holds a snapshot of the task’s registers when the task is suspended.

Not all operating system kernels support kernel task suspension. For instance, WeensyOS, the CS61 operating system, does not, and microkernels often do not. In non-suspendable operating systems, a user thread can still block, but all kernel state associated with a blocked thread must be managed explicitly by the kernel programmer. Kernel task suspension makes some kinds of kernel programming easier, but it also requires more memory.

CPU contexts are used when switching from one kernel task to another. Each CPU has its own CPU context; while kernel tasks can switch among CPUs, a CPU context is pinned to a single CPU. CPU contexts are not suspendable, and each CPU context has its own stack. The only kernel functions that run in CPU context are entry points from user context (exception_entry, alt_exception_entry) and cpustate::schedule.

Kernel stacks are collocated on memory pages that contain other data structures. The page of memory containing a kernel task stack also holds a struct proc, and the page of memory containing a CPU context stack also holds a struct cpustate. These structures are located in the beginnings of their pages (starting at the first address of the page), whereas the kernel stacks are located at the ends of the pages and grow down. Kernel functions shouldn’t use too many local variables or recursively call other functions too deeply, or stack data is liable to crash into and destroy the cpustate and/or proc.

`struct regstate`

Context switches require saving a set of registers and restoring them later. In Chickadee, this process centers on struct regstate structures. This structure, which is defined in x86-64.h, has space for all the registers Chickadee programs may use.

x86-64 register sets are large—struct regstate takes 192 bytes—and saving and restoring that much state is expensive. Optimized OSes avoid full state save and restore when possible.

Chickadee doesn’t provide support for saving and restoring floating-point registers or SIMD registers (MMX, SSE).

User exceptions: involuntary user context switch

We first consider what happens when the CPU takes an exception while executing user code. These are generally interrupts and faults, and thus involve involuntary context switches initiated from hardware, but traps (intentional exceptions caused by an int or int3 instruction) use the same mechanism.

The x86-64 exception mechanism involves several tables set up by software and interpreted by microcode (processor hardware). In particular, when an exception occurs:

The hardware looks up the exception number in the interrupt descriptor table (IDT). This table holds an entry for each supported exception.
The IDT entry contains an entry point, which is the start address of the function that will handle the interrupt, and a task segment selector.
The task segment selector is an index into another table, the general descriptor table (GDT). This weird table holds multiple kinds of entries, but in the end, through several layers of indirection, it defines the initial stack pointer for the interrupt handler and the privilege mode for the interrupt handler.
The hardware now knows the entry point to the interrupt handler, the privilege mode in which that handler should run, and the initial stack pointer for that handler. It pushes five critical registers onto the handler stack, %rsp, %ss, %rflags, %cs, and %rip. Then it changes those critical registers to new values. %rip is set to the handler entry point, %rsp is set to the handler stack (minus 40 bytes, to account for the pushed registers), and %cs, %ss, and %rflags are updated to account for the handler’s new privilege.

Then the handler software takes over.

Chickadee initializes the x86-64 exception mechanism as follows.

The OS has a single global interrupt descriptor table, interrupt_descriptors, that’s shared by all cores.
Each CPU has its own GDT and task state. Different CPUs’ GDTs are very similar except that each CPU defines its own initial stack pointer for interrupts, ensuring that interrupts on different processors can happen simultaneously without colliding. A CPU’s initial stack pointer is the top of the corresponding CPU context stack. The GDT and task state are stored in struct cpustate.

Chickadee’s exception handler starts execution on the CPU-context stack. But the user context’s state, including the five critical registers pushed by the processor’s exception delivery mechanism, belong in space owned by the user context. For this reason, the exception handler immediately switches to the current thread’s kernel task stack and moves the saved registers there. It must do this without corrupting the other general-purpose registers, which still contain user-level values.

These lines in exception_entry transfer the state.

// change %rsp to the top of the kernel task stack
swapgs
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp

// copy data from CPU stack to kernel task stack
pushq %gs:(CPUSTACK_SIZE - 8)  // %ss
...
pushq %gs:(CPUSTACK_SIZE - 56) // interrupt number

This code uses a special feature in x86-64 processors called the kernel GS base, which is a hidden register in which the kernel can stash a per-CPU pointer value. In Chickadee, each CPU sets this register to point to its cpustate, which is the bottom of the memory page containing its CPU stack. These instructions thus act as follows:

swapgs: Swaps in the kernel GS base for the current CPU. This step will be undone by another swapgs instruction when control is returned to the user.
movq %gs:(8), %rsp: Once the kernel GS base is installed, a memory reference like %gs:(N) refers N bytes into the current CPU’s cpustate structure. Here, the value %gs:(8) corresponds to this_cpu()->current_, which is the address of the struct proc for the currently-running user process on this CPU. So this instruction changes %rsp to current_.
addq $PROCSTACK_SIZE, %rsp: This moves %rsp to the top of the current process’s kernel task stack.
pushq %gs:(CPUSTACK_SIZE - N): These lines transfer data from the CPU stack to the kernel task stack, one quadword at a time. The address %gs:(CPUSTACK_SIZE - 8) points to the top quadword on the CPU stack.

Because the offsets are hard-coded (addq $PROCSTACK_SIZE, %rsp and %gs:(CPUSTACK_SIZE - N)), this code will work only if both the CPU stack and the currently-executing process’s kernel task stack were empty when the exception occurred. But that’s always true for user exceptions! The CPU stack is empty unless the scheduler is running, and a process’s kernel task stack is empty when the process is executing in user mode.

The rest of exception_entry is more straightforward: the handler pushes the rest of the general-purpose registers onto the stack and then calls proc::exception(regstate*).

Within proc::exception, the current proc is accessible as this, and the saved regstate is accessible as the function argument.

When proc::exception returns, the exception handler must resume user code where it left off. It does this by popping general-purpose registers, calling swapgs again, and finally ending with iretq, a special instruction that pops %rip, %cs, %rflags, %ss, and %rsp from the stack and lowers CPU privilege accordingly.

Fake stackframes to facilitate backtraces

exception_entry and syscall_entry have some code setting up a “fake stackframe to facilitate backtraces.” Here’s the exception_entry version:
// add a fake stackframe to facilitate backtraces
pushq 152(%rsp)                // exception-time %rip
pushq %rbp                     // exception-time %rbp
movq %rsp, %rbp
...
// remove fake stackframe
addq $16, %rsp
What’s this about? Much Chickadee debugging, including assertion failures, involves looking at backtraces—human-readable lists of instructions starting when a problem was discovered, and then going backwards through the caller function, the caller’s caller, and so forth. Debuggers (and Chickadee) compute backtraces using the x86-64 calling convention. In debug-friendly x86-64 code, the %rbp register marks the highest address of the current function’s stack frame, and the caller’s %rbp and return address are stored at 0(%rbp) and 8(%rbp), respectively. This allows a debugger to trace backwards through stack frames. But the trace fails as soon as a function disobeys the convention! The fake stack frames introduced here allow Chickadee’s backtracer to trace back from kernel code, through exception handlers and system calls written in assembly, to the relevant user code.

System calls: voluntary user context switch

Voluntary context switches, such as system calls, don’t need to save state in the same comprehensive way as involuntary context switches. A calling convention can be established that lets the kernel cut corners. Chickadee system calls are treated like function calls: the kernel only guarantees that it will save and restore the callee-saved registers, which are %rbp, %rbx, and %r12-%r15.

Chickadee system calls use the x86-64 syscall instruction, which was designed specifically for system calls on modern operating systems. syscall is lighter-weight than the exception mechanism: where exceptions save several critical registers on a predefined kernel stack, syscall just juggles some registers around. In particular, syscall:

Saves the old %rflags in %r11.
Saves the old %rip in %rcx.
Changes %rip to the system call entry point, which was predeclared by writing to a special hidden register, MSR_IA32_LSTAR.
Changes the processor’s privilege levels and flags according to other special hidden registers, MSR_IA32_STAR and MSR_IA32_FMASK.

syscall does not, for example, modify %rsp, which is unchanged from the user value. This means the syscall entry code must jump through some hoops to obtain a meaningful stack pointer.

For simplicity, the Chickadee system call entry point sets up state analogous to the usual exception entry point. Here’s how it works. syscall_entry starts this way:

swapgs
movq %rsp, %gs:(16)
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp

These lines, like the corresponding lines in exception_entry, switch the stack pointer to the currently-running process’s kernel task stack. There is one wrinkle: before switching the stack pointer, the code saves the current stack pointer in a special scratch area in the cpustate, %gs:(16).

The next step is to push the five critical registers (%ss, %rsp, %rflags, %cs, and %rip) in the same way that the x86-64 exception mechanism would have:

pushq $(SEGSEL_APP_DATA + 3)   /* %ss */
pushq %gs:(16)                 /* %rsp */
pushq %r11                     /* %rflags */
pushq $(SEGSEL_APP_CODE + 3)   /* %cs */
pushq %rcx                     /* %rip */

Breaking this down:

All user code runs with the same %ss value, SEGSEL_APP_DATA. The + 3 represents user privilege mode (“DPL”).
We stashed the old %rsp value in the cpustate at %gs:(16).
The syscall instruction put the old %rflags value in %r11.
All user code runs with the same %cs value, SEGSEL_APP_CODE. Again, the + 3 represents user privilege mode (“CPL”).
The syscall instruction put the old %rip value in %rcx.

syscall_entry then follows basically the same steps as exception_entry, except it calls proc::syscall(regstate*) rather than proc::exception(regstate*).

When proc::syscall(regstate*) returns, its return value should be passed back to the user process and the user process should be resumed. But the syscall_entry return path is simpler than the exception return path. Here it is:

addq $(8 * 15), %rsp           // skip general-purpose registers
pop %fs                        // restore `%fs`
cli                            // prevent interrupts after `swapgs`
swapgs
pop %gs                        // restore `%gs`
addq $(8 * 2), %rsp            // skip reg_swapgs, reg_intno, reg_errcode
iretq

Why is it (relatively) safe to skip restoring general-purpose registers? The reason is that both the Chickadee kernel and the Chickadee system call interface follow the compiler’s x86-64 calling convention. Chickadee processes expect callee-saved registers to be preserved by syscall. But these are exactly the registers that the C++ compiler preserves across function calls! Since every kernel function preserves the callee-saved registers (by saving them on entry and restoring them on exit), the syscall_entry assembly code need not restore them. The caller-saved registers may have changed, but that’s OK—the process code expects them to change.

Kernel yields: voluntary kernel context switch

A kernel task can voluntarily surrender its control of the CPU by calling proc::yield(). This function stores just enough state that the kernel task can be resumed later, then switches to CPU context and runs the scheduler to find another task (assuming another task exists).

Just as with system calls, the natural calling convention for proc::yield() is that callee-saved registers will be preserved. Thus, the proc::yield() implementation, which is in k-exception.S, saves callee-saved registers and the flags register to the stack, in the order determined by struct yieldstate.

proc::yield() then stores a pointer to this saved state in the current proc and switches to executing cpustate::schedule() on the current CPU stack, using the magic kernel GS base to find the current CPU.

// clear interrupts and store yieldstate pointer
cli
movq %rsp, 16(%rdi)
// switch to cpustack
movq %gs:(0), %rdi
leaq CPUSTACK_SIZE(%rdi), %rsp
// jump to scheduler
jmp _ZN8cpustate8scheduleEv

It is safe to set the stack pointer unconditionally to the top of the CPU stack (that is, to CPUSTACK_SIZE(%rdi)) because the CPU stack is unused unless the scheduler is actively running. Note that the proc::yield() function does not return—it has no ret instruction. Instead, it changes its stack pointer and jumps directly to the cpustate::schedule() function. However, the return address of proc::yield() is preserved; it’s stored on the kernel task stack, and will be used later.

A task that has yielded voluntarily will resume only when the scheduler decides it should run again. The following code, in cpustate::schedule(), performs the resume:

set_pagetable(current_->pagetable_);
current_->resume();

After loading the current process’s pagetable, this code calls proc::resume() on that process. The code for that function is also defined in k-exception.S; in this case, it simply changes the stack pointer to point at the stored struct yieldstate, restores the caller-saved registers, and returns. The return instruction in proc::resume() uses the return address that was pushed onto the stack when proc::yield() was called.

Kernel exceptions: involuntary kernel context switch

Chickadee user contexts always run with interrupts enabled. Chickadee kernel tasks, however, start with interrupts disabled. This means that interrupts, such as timer interrupts, are delayed from delivery until a user context starts running.

Context switch mechanisms that switch to kernel mode, namely exception delivery and the syscall instruction, automatically disable interrupts as part of the switch. (This is controlled by the “gate type” for each interrupt descriptor, and by the special MSR_IA32_MFLAGS register used to configure syscall.) The iretq instruction that returns to user context automatically re-enables interrupts.

However, interrupts can indicate important, latency-sensitive hardware events, so disabling interrupts for a long time can cause performance problems. Chickadee therefore allows kernel tasks to re-enable interrupts. For an example, see the implementation of the SYSCALL_PAUSE system call.

A kernel task with interrupts enabled might be interrupted, causing an involuntary context switch in kernel mode. This engages the usual hardware exception mechanism, but since in this case there is no privilege change, the hardware behaves a little differently. Rather than switching to a new stack, it pushes the five critical registers onto the currently active stack, which is always a kernel task stack (CPU contexts never enable interrupts).

The exception_entry entry point must check whether the exception was received in kernel mode:

testb $3, 32(%rsp)
jz 1f

The first instruction checks the lower two bits of the pushed %cs selector, which hold the exception-time privilege level. If the exception happened in kernel mode, those bits are zero, and the jz 1f instruction will skip over the steps that modify the kernel GS base and copy the saved registers to the kernel task stack. A similar privilege-level check is necessary in the restore path.

Starting a process

You may have noticed that in all these prior context switches, a user context resumes only when the relevant kernel task context returns, via ret instructions, to the assembly code from the initial entry point (exception_entry or syscall_entry). So how can a process run for the first time, with no corresponding entry code?

Newly started processes—whether started at initialization time, or later via fork—are given a constructed regstate, stored on the corresponding kernel task stack. The kernel initializes this regstate and stores a pointer to it in the corresponding struct proc. The proc::resume() code looks not only for the yield state pushed in case of voluntary kernel context switch, but also for a regstate. If one is found, proc::resume() jumps directly to the assembly that pops general-purpose registers and restores sensitive registers from that regstate.

Kinds of context

`struct regstate`

User exceptions: involuntary user context switch

Saving and restoring `%gs`

Alternate exception entry point

Fake stackframes to facilitate backtraces

System calls: voluntary user context switch

Special cases and safety

Kernel yields: voluntary kernel context switch

Kernel exceptions: involuntary kernel context switch

Starting a process

Contexts

Kinds of context

struct regstate

User exceptions: involuntary user context switch

Saving and restoring %gs

Alternate exception entry point

Fake stackframes to facilitate backtraces

System calls: voluntary user context switch

Special cases and safety

Kernel yields: voluntary kernel context switch

Kernel exceptions: involuntary kernel context switch

Starting a process

`struct regstate`

Saving and restoring `%gs`