Contexts – CS 161 2018

Operating systems multiplex hardware resources, which requires switching between contexts. Some context switches are voluntary; for example, a system call intentionally and synchronously transfers control to the kernel. Other times, a task switches contexts unexpectedly, because of an interrupt or other exception.

Context switches are difficult because of state. Running tasks use many machine resources, including registers. Safely switching the CPU away from a task requires saving that task’s state so it can resume later. But saving state requires registers—registers the old task was using! It’s easy for a context switch to inadvertently clobber register state on which a task depends. Context switches work only given careful coding and special hardware support.

Kinds of context

Chickadee supports three kinds of context: user context, kernel task context, and kernel CPU context. User context is unprivileged; the two kinds of kernel context have full machine privilege.

Chickadee can suspend kernel tasks. For instance, a Chickadee system call implementation can voluntarily or involuntarily give up control to another task, picking up later right where it left off. To make this work, each kernel task context has its own kernel stack. When the processor is running a kernel task, the %rsp register points into the corresponding stack. Just like any stack, a kernel task stack holds local variables. It also holds a snapshot of the task’s registers when the task is suspended.

Not all operating system kernels support kernel task suspension. For instance, WeensyOS, the CS61 operating system, does not, and microkernels often do not. In non-suspendable operating systems, a user process can still block, but all kernel state associated with a blocked process must be managed explicitly by the kernel programmer. Kernel task suspension makes some kinds of kernel programming easier, but it also requires more memory.

Kernel CPU contexts are used when switching from one kernel task to another. Each CPU has its own kernel CPU context; while kernel tasks can switch among CPUs, a CPU context is pinned to a single CPU. CPU contexts are not suspendable, and each CPU context has its own stack.

Kernel stacks share memory pages with corresponding data structures. The bottom of the page of memory containing a CPU context stack holds a struct cpustate, and the bottom of the page of memory containing a kernel task stack holds a struct proc. Kernel functions shouldn’t use too many local variables or recursively call other functions too deeply, or stack data is liable to crash into and destroy the cpustate and/or proc.

`struct regstate`

Context switches require saving a set of registers and restoring them later. In Chickadee, this process centers on struct regstate structures. This structure, which is defined in x86-64.h, has space for all the registers Chickadee programs may use.

x86-64 register sets are large—struct regstate takes 192 bytes—and saving and restoring that much state is expensive. Optimized OSes avoid full state save and restore when possible.

Chickadee doesn’t provide support for saving and restoring floating-point registers or SIMD registers (MMX, SSE).

User exceptions: involuntary user context switch

We first consider what happens when the CPU takes an exception while executing user code. These are generally interrupts and faults, and thus involve involuntary context switches initiated from hardware, but traps (intentional exceptions caused by an int instruction) use the same mechanism.

The x86-64 exception mechanism involves several tables set up by software and interpreted by microcode (processor hardware). In particular, when an exception occurs:

The hardware looks up the exception number in the interrupt descriptor table (IDT). This table holds an entry for each supported exception.
The IDT entry contains an entry point, which is the start address of the function that will handle the interrupt, and a task segment selector.
The task segment selector is an index into another table, the general descriptor table (GDT). This weird table holds multiple kinds of entries, but in the end, through several layers of indirection, it defines the initial stack pointer for the interrupt handler and the privilege mode for the interrupt handler.
The hardware now knows the entry point to the interrupt handler, the privilege mode in which that handler should run, and the initial stack pointer for that handler. It pushes five critical registers onto the handler stack, %rsp, %ss, %rflags, %cs, and %rip. Then it changes those critical registers to new values. %rip is set to the handler entry point, %rsp is set to the handler stack (minus 40 bytes, to account for the pushed registers), and %cs, %ss, and %rflags are updated to account for the handler’s new privilege.

Then the handler software takes over.

Chickadee initializes the x86-64 exception mechanism as follows.

The OS has a single global interrupt descriptor table, interrupt_descriptors, that’s shared by all cores.
Each CPU has its own GDT and task descriptor. Different CPUs’ GDTs are very similar except that each CPU defines its own initial stack pointer for interrupts, ensuring that interrupts on different processors can happen simultaneously without colliding. A CPU’s initial stack pointer is the top of the corresponding CPU context stack. The GDT and task descriptor are stored in struct cpustate.

When Chickadee’s exception handler gains control from a user-mode exception, it is executing on the kernel CPU-context stack. Its first job is to switch to the corresponding kernel task stack. This is important because all state associated with a process, including saved registers, should be stored in memory regions associated with that process, as opposed to the shared CPU-context stack.

These lines in exception_entry transfer the state.

/* change %rsp to the top of the pkstack */
swapgs
movq %gs:(8), %rsp
addq $KTASKSTACK_SIZE, %rsp

/* copy data from cpustack to pkstack */
pushq %gs:(CPUSTACK_SIZE - 8)  /* %ss */
...
pushq %gs:(CPUSTACK_SIZE - 64) /* %gs */

This code uses a special feature in x86-64 processors called the kernel GS base, which is a special hidden register in which the kernel can stash a per-CPU pointer value. In Chickadee, each CPU sets this register to point to its cpustate, which is the bottom of the memory page containing its CPU stack. These instructions thus act as follows:

swapgs: This instruction swaps the user’s GS value with the kernel GS base for the current CPU. This step will be undone by another swapgs instruction when control is returned to the user.
movq %gs:(8), %rsp: Once the kernel GS base is installed, a memory reference like %gs:(N) refers N bytes into the current CPU’s cpustate structure. Here, the value %gs:(8) corresponds to this_cpustate->current_, which is the address of the struct proc for the currently-running user process on this CPU. So this instruction changes %rsp to current_.
addq $KTASKSTACK_SIZE, %rsp: This moves %rsp to the top of the current process’s kernel task stack.
pushq %gs:(CPUSTACK_SIZE - 8): These lines transfer data from the CPU stack to the kernel task stack, one quadword at a time. The address %gs:(CPUSTACK_SIZE - 8) points to the top quadword on the CPU stack.

Because the offsets are hard-coded (addq $KTASKSTACK_SIZE, %rsp and %gs:(CPUSTACK_SIZE - 8)), this code will work only if both the CPU stack and the currently-executing process’s kernel task stack were empty when the exception occurred. But that invariant is always true for user exceptions! The CPU stack is empty unless the scheduler is running, and a process’s kernel task stack is empty when the process is executing in user mode.

The rest of exception_entry is more straightforward: the handler pushes the rest of the general-purpose registers onto the stack and then calls proc::exception(regstate*).

Within proc::exception, the current proc is accessible as this, and the saved regstate is accessible as the function argument.

When proc::exception returns, the exception handler must resume user code where it left off. It does this by popping general-purpose registers, calling swapgs again, and finally ending with iretq, a special instruction that pops %rip, %cs, %rflags, %ss, and %rsp from the stack and lowers CPU privilege accordingly.

System calls: voluntary user context switch

Voluntary context switches, such as system calls, don’t need to save state in the same comprehensive way as involuntary context switches. A calling convention can be established that lets the kernel cut corners. Chickadee system calls are treated like function calls: the kernel only guarantees that it will save and restore the callee-saved registers, which are %rbp, %rbx, and %r12-%r15.

Chickadee system calls use the x86-64 syscall instruction, which was designed specifically for system calls on modern operating systems. syscall is lighter-weight than the exception mechanism: where exceptions save several critical registers on a predefined kernel stack, syscall just juggles some registers around. In particular, syscall:

Saves the old %rflags in %r11.
Saves the old %rip in %rcx.
Changes %rip to the system call entry point, which was predeclared by writing to a special hidden register, MSR_IA32_LSTAR.
Changes the processor’s privilege levels and flags according to other special hidden registers, MSR_IA32_STAR and MSR_IA32_FMASK.

syscall does not, for example, modify %rsp, which is unchanged from the user value. This means the syscall entry code must jump through some hoops to obtain a meaningful stack pointer.

For simplicity, the Chickadee system call entry point sets up state analogous to the usual exception entry point. Here’s how it works. syscall_entry starts this way:

swapgs
movq %rsp, %gs:(16)
movq %gs:(8), %rsp
addq $KTASKSTACK_SIZE, %rsp

These lines, like the corresponding lines in exception_entry, switch the stack pointer to the currently-running process’s kernel task stack. There is one wrinkle: before switching the stack pointer, the code saves the current stack pointer in a special scratch area in the cpustate, %gs:(16).

The next step is to push the five critical registers (%ss, %rsp, %rflags, %cs, and %rip) in the same way that the x86-64 exception mechanism would have:

pushq $(SEGSEL_APP_DATA + 3)   /* %ss */
pushq %gs:(16)                 /* %rsp */
pushq %r11                     /* %rflags */
pushq $(SEGSEL_APP_CODE + 3)   /* %cs */
pushq %rcx                     /* %rip */

Breaking this down:

All user code runs with the same %ss value, SEGSEL_APP_DATA. The + 3 represents user privilege mode (“DPL”).
We stashed the old %rsp value in the cpustate at %gs:(16).
The syscall instruction put the old %rflags value in %r11.
All user code runs with the same %cs value, SEGSEL_APP_CODE. Again, the + 3 represents user privilege mode (“CPL”).
The syscall instruction put the old %rip value in %rcx.

syscall_entry then follows basically the same steps as exception_entry, except it calls proc::syscall(regstate*) rather than proc::exception(regstate*).

When proc::syscall(regstate*) returns, its return value should be passed back to the user process and the user process should be resumed. But the syscall_entry return path is much simpler than the exception return path. Here it is:

addq $(8 * 19), %rsp
swapgs
iretq

We can get away with this because the C++ compiler naturally preserved the callee-saved registers, the %rax register already contains the desired return value, and the syscall calling convention says all other caller-saved registers have garbage values when the system call returns. It thus suffices to skip over all the saved general-purpose registers, restore the hidden kernel GS base register, and execute iretq to restore the user context.

Kernel yields: voluntary kernel context switch

A kernel task can voluntarily surrender its control of the CPU by calling proc::yield(). This function stores just enough state that the kernel task can be resumed later, then switches to CPU context and runs the scheduler to find another task (assuming another task exists).

Just as with system calls, the natural calling convention for proc::yield() is that callee-saved registers will be preserved. Thus, the proc::yield() implementation, which is in k-exception.S, saves callee-saved registers and the flags register to the stack, in the order determined by struct yieldstate.

proc::yield() then stores a pointer to this saved state in the current proc and switches to executing cpustate::schedule() on the current CPU stack, using the magic kernel GS base to find the current CPU.

/* store yieldstate pointer */
movq %rsp, 16(%rdi)

/* disable interrupts, switch to cpustack */
cli
movq %rdi, %rsi
movq %gs:(0), %rdi
leaq CPUSTACK_SIZE(%rdi), %rsp

/* call scheduler */
jmp _ZN8cpustate8scheduleEP4proc

It is safe to set the stack pointer unconditionally to the top of the CPU stack (that is, to CPUSTACK_SIZE(%rdi)) because the CPU stack is unused unless the scheduler is actively running. Note that the proc::yield() function *does not return*—it has no ret instruction. Instead, it changes its stack pointer and jumps directly to the cpustate::schedule() function. However, the return address of proc::yield() is preserved; it’s stored on the kernel task stack, and will be used later.

A task that has yielded voluntarily will resume only when the scheduler decides it should run again. The following code, in cpustate::schedule(), performs the resume:

set_pagetable(current_->pagetable_);
current_->resume();

After loading the current process’s pagetable, this code calls proc::resume() on that process. The code for that function is also defined in k-exception.S; in this case, it simply changes the stack pointer to point at the stored struct yieldstate, restores the caller-saved registers, and returns. The return instruction in proc::resume() uses the return address that was pushed onto the stack when proc::yield() was called.

Kernel exceptions: involuntary kernel context switch

Chickadee user contexts always run with interrupts enabled. Chickadee kernel tasks, however, start with interrupts disabled. This means that interrupts, such as timer interrupts, are delayed from delivery until a user context starts running.

Context switch mechanisms that switch to kernel mode, namely exception delivery and the syscall instruction, automatically disable interrupts as part of the switch. (This is controlled by the “gate type” for each interrupt descriptor, and by the special MSR_IA32_MFLAGS register used to configure syscall.) The iretq instruction that returns to user context automatically re-enables interrupts.

However, interrupts can indicate important, latency-sensitive hardware events, so disabling interrupts for a long time can cause performance problems. Chickadee therefore allows kernel tasks to re-enable interrupts. For an example, see the implementation of the SYSCALL_PAUSE system call.

A kernel task with interrupts enabled might be interrupted, causing an involuntary context switch in kernel mode. This engages the usual hardware exception mechanism, but since in this case there is no privilege change, the hardware behaves a little differently. Rather than switching to a new stack, it pushes the five critical registers onto the currently active stack, which is always a kernel task stack (kernel CPU contexts never enable interrupts).

The exception_entry entry point must check whether the exception was received in kernel mode:

testb $3, 32(%rsp)
jz 1f

The first instruction checks the lower two bits of the pushed %cs selector, which hold the exception-time privilege level. If the exception happened in kernel mode, those bits are zero, and the jz 1f instruction will skip over the steps that modify the kernel GS base and copy the saved registers to the kernel task stack. A similar privilege-level check is necessary in the restore path.

Starting a process

You may have noticed that in all these prior context switches, a user context resumes only when the relevant kernel task context returns, via ret instructions, to the assembly code from the initial entry point (exception_entry or syscall_entry). So how can a process run for the first time, with no corresponding entry code?

Newly started processes—whether started at initialization time, or later via fork—are given a constructed regstate, stored on the corresponding kernel task stack. The kernel initializes this regstate and stores a pointer to it in the corresponding struct proc. The proc::resume() code looks not only for the yield state pushed in case of voluntary kernel context switch, but also for a regstate. If one is found, proc::resume() jumps directly to the assembly that pops general-purpose registers and restores sensitive registers from that regstate.