Operating systems multiplex hardware resources, which requires switching between contexts. Some context switches are voluntary; for example, a system call intentionally and synchronously transfers control to the kernel. Other times, a task switches contexts unexpectedly, because of an interrupt or other exception.
Context switches are difficult because of state. Running tasks use many machine resources, including registers. Safely switching the CPU away from a task requires saving that task’s state so it can resume later. But saving state requires registers—registers the old task was using! It’s easy for a context switch to inadvertently clobber register state on which a task depends. Context switches work only given careful coding and special hardware support.
Kinds of context
Chickadee supports three kinds of context: user context, kernel task context, and kernel CPU context. User context is unprivileged; the two kinds of kernel context have full machine privilege.
Chickadee can suspend kernel tasks. For instance, a Chickadee system call
implementation can voluntarily or involuntarily give up control to another
task, picking up later right where it left off. To make this work, each kernel
task context has its own kernel stack. When the processor is running a
kernel task, the
%rsp register points into the corresponding stack. Just
like any stack, a kernel task stack holds local variables. It also holds a
snapshot of the task’s registers when the task is suspended.
Not all operating system kernels support kernel task suspension. For instance, WeensyOS, the CS61 operating system, does not, and microkernels often do not. In non-suspendable operating systems, a user thread can still block, but all kernel state associated with a blocked thread must be managed explicitly by the kernel programmer. Kernel task suspension makes some kinds of kernel programming easier, but it also requires more memory.
Kernel CPU contexts are used when switching from one kernel task to another. Each CPU has its own kernel CPU context; while kernel tasks can switch among CPUs, a CPU context is pinned to a single CPU. CPU contexts are not suspendable, and each CPU context has its own stack.
Kernel stacks share memory pages with corresponding data structures. The
bottom of the page of memory containing a CPU context stack holds a
cpustate, and the bottom of the page of memory containing a kernel task stack
struct proc. Kernel functions shouldn’t use too many local variables
or recursively call other functions too deeply, or stack data is liable to
crash into and destroy the
Context switches require saving a set of registers and restoring them later.
In Chickadee, this process centers on
struct regstate structures. This
structure, which is defined in
x86-64.h, has space for all the registers
Chickadee programs may use.
x86-64 register sets are large—
struct regstate takes 192 bytes—and saving
and restoring that much state is expensive. Optimized OSes avoid full state
save and restore when possible.
Chickadee doesn’t provide support for saving and restoring floating-point registers or SIMD registers (MMX, SSE).
User exceptions: involuntary user context switch
We first consider what happens when the CPU takes an exception while executing
user code. These are generally interrupts and faults, and thus involve
involuntary context switches initiated from hardware, but traps (intentional
exceptions caused by an
int instruction) use the same mechanism.
The x86-64 exception mechanism involves several tables set up by software and interpreted by microcode (processor hardware). In particular, when an exception occurs:
The hardware looks up the exception number in the interrupt descriptor table (IDT). This table holds an entry for each supported exception.
The IDT entry contains an entry point, which is the start address of the function that will handle the interrupt, and a task segment selector.
The task segment selector is an index into another table, the general descriptor table (GDT). This weird table holds multiple kinds of entries, but in the end, through several layers of indirection, it defines the initial stack pointer for the interrupt handler and the privilege mode for the interrupt handler.
The hardware now knows the entry point to the interrupt handler, the privilege mode in which that handler should run, and the initial stack pointer for that handler. It pushes five critical registers onto the handler stack,
%rip. Then it changes those critical registers to new values.
%ripis set to the handler entry point,
%rspis set to the handler stack (minus 40 bytes, to account for the pushed registers), and
%rflagsare updated to account for the handler’s new privilege.
Then the handler software takes over.
Chickadee initializes the x86-64 exception mechanism as follows.
The OS has a single global interrupt descriptor table,
interrupt_descriptors, that’s shared by all cores.
Each CPU has its own GDT and task state. Different CPUs’ GDTs are very similar except that each CPU defines its own initial stack pointer for interrupts, ensuring that interrupts on different processors can happen simultaneously without colliding. A CPU’s initial stack pointer is the top of the corresponding CPU context stack. The GDT and task state are stored in
When Chickadee’s exception handler gains control from a user-mode exception, it is executing on the kernel CPU-context stack. But that stack is shared by all code running on the CPU; thread state, such as the five critical registers pushed by the exception delivery mechanism, belongs instead on kernel task stacks. So the exception handler’s first job is to switch to the current thread’s kernel task stack and move the saved registers there. It must do this without corrupting the other general-purpose registers, which still contain user-level values.
These lines in
exception_entry transfer the state.
// change %rsp to the top of the kernel task stack swapgs movq %gs:(8), %rsp addq $PROCSTACK_SIZE, %rsp // copy data from CPU stack to kernel task stack pushq %gs:(CPUSTACK_SIZE - 8) // %ss ... pushq %gs:(CPUSTACK_SIZE - 56) // interrupt number
This code uses a special feature in x86-64 processors called the kernel GS
base, which is a hidden register in which the kernel can stash a
per-CPU pointer value. In Chickadee, each CPU sets this register to point to
cpustate, which is the bottom of the memory page containing its CPU
stack. These instructions thus act as follows:
swapgs: Swaps in the kernel GS base for the current CPU. This step will be undone by another
swapgsinstruction when control is returned to the user.
movq %gs:(8), %rsp: Once the kernel GS base is installed, a memory reference like
Nbytes into the current CPU’s
cpustatestructure. Here, the value
this_cpu()->current_, which is the address of the
struct procfor the currently-running user process on this CPU. So this instruction changes
addq $PROCSTACK_SIZE, %rsp: This moves
%rspto the top of the current process’s kernel task stack.
pushq %gs:(CPUSTACK_SIZE - N): These lines transfer data from the CPU stack to the kernel task stack, one quadword at a time. The address
%gs:(CPUSTACK_SIZE - 8)points to the top quadword on the CPU stack.
Because the offsets are hard-coded (
addq $PROCSTACK_SIZE, %rsp and
%gs:(CPUSTACK_SIZE - N)), this code will work only
if both the CPU stack and the currently-executing process’s kernel task stack
were empty when the exception occurred. But that’s always true
for user exceptions! The
CPU stack is empty unless the scheduler is running, and a process’s kernel
task stack is empty when the process is executing in user mode.
The rest of
exception_entry is more straightforward: the handler
pushes the rest of the general-purpose registers onto the stack and then calls
proc::exception, the current
proc is accessible as
this, and the
regstate is accessible as the function argument.
proc::exception returns, the exception handler must resume
user code where it left off. It does this by popping general-purpose
swapgs again, and finally ending with
iretq, a special
instruction that pops
from the stack and lowers CPU privilege accordingly.
Saving and restoring
User-mode programs can modify
%gsusing instructions like
movw %ax, %gsand
popw %gs. So why doesn’t Chickadee reset
%gsto a known-good value on kernel entry? And why doesn’t Chickadee
restore_and_iretwhen resuming a kernel-mode task?
The reason is that loading
%gsclears the GS base. When resuming a process, that’s no problem: current Chickadee processes never use the GS base. But in kernel mode the GS base (which was installed by
swapgs) is critically important and the
%gsnumeric value doesn’t matter at all. Changing
%gsto a known-good value would restore something useless while clearing something important. So kernel tasks don’t bother.
If processes did use the GS base—for instance, to access per-thread state—Chickadee would need to be a bit more careful.
The alternate exception entry point
A few uncommon exceptions—#DB (debug exceptions), #NM (non-maskable interrupts), and #MC (machine-check exceptions)—use a different mechanism called the
alt_exception_entry. These exceptions are special because they can happen at any time, including while other interrupts are disabled. For instance, a #DB exception can trigger at the moment the normal exception handler gets control! This introduces problems difficult to handle without an alternate entry point, including an alternate exception stack (which in Chickadee starts 1024 bytes down from the top of the CPU stack).
You’ll never encounter these exceptions in handout Chickadee on QEMU. #NM and #MC happen when hardware has serious problems. #DB, though, implements useful debugging features such as breakpoints and efficient memory watchpoints, so you may want to add support for it. It’s a little weird that #DB requires an alternate entry point, but x86-64 is weird.
Potential gotcha: Handlers for alternate-entry-point exceptions must not call
cpustate::schedule(). Instead, they must use
regs_ = regs; resume()to resume the interrupted task. If an alternate-stack exception happens during a normal exception handler, the CPU stack can contain important information that
System calls: voluntary user context switch
Voluntary context switches, such as system calls, don’t need to save state in
the same comprehensive way as involuntary context switches. A calling
convention can be established that lets the kernel cut corners. Chickadee
system calls are treated like function calls: the kernel only guarantees that
it will save and restore the callee-saved registers, which are
Chickadee system calls use the x86-64
syscall instruction, which was
designed specifically for system calls on modern operating systems.
is lighter-weight than the exception mechanism: where exceptions save several
critical registers on a predefined kernel stack,
syscall just juggles some
registers around. In particular,
- Saves the old
- Saves the old
%ripto the system call entry point, which was predeclared by writing to a special hidden register,
- Changes the processor’s privilege levels and flags according to other
special hidden registers,
syscall does not, for example, modify
%rsp, which is unchanged from the
user value. This means the
syscall entry code must jump through some hoops
to obtain a meaningful stack pointer.
For simplicity, the Chickadee system call entry point sets up state analogous
to the usual exception entry point. Here’s how it works.
starts this way:
swapgs movq %rsp, %gs:(16) movq %gs:(8), %rsp addq $PROCSTACK_SIZE, %rsp
These lines, like the corresponding lines in
the stack pointer to the currently-running process’s kernel task stack. There
is one wrinkle: before switching the stack pointer, the code saves the current
stack pointer in a special scratch area in the
The next step is to push the five critical registers (
%rip) in the same way that the x86-64 exception
mechanism would have:
pushq $(SEGSEL_APP_DATA + 3) /* %ss */ pushq %gs:(16) /* %rsp */ pushq %r11 /* %rflags */ pushq $(SEGSEL_APP_CODE + 3) /* %cs */ pushq %rcx /* %rip */
Breaking this down:
- All user code runs with the same
+ 3represents user privilege mode (“DPL”).
- We stashed the old
%rspvalue in the
syscallinstruction put the old
- All user code runs with the same
SEGSEL_APP_CODE. Again, the
+ 3represents user privilege mode (“CPL”).
syscallinstruction put the old
syscall_entry then follows basically the same steps as
exception_entry, except it calls
proc::syscall(regstate*) returns, its return value should be passed
back to the user process and the user process should be resumed. But the
syscall_entry return path is much simpler than the exception return path.
Here it is:
addq $(8 * 19), %rsp swapgs iretq
We can get away with this because the C++ compiler naturally preserved the
callee-saved registers, the
%rax register already contains the desired
return value, and the
syscall calling convention says all other caller-saved
registers have garbage values when the system call returns. It thus suffices
to skip over all the saved general-purpose registers, restore the hidden
kernel GS base register, and execute
iretq to restore the user context.
Kernel yields: voluntary kernel context switch
A kernel task can voluntarily surrender its control of the CPU by calling
proc::yield(). This function stores just enough state that the kernel task
can be resumed later, then switches to CPU context and runs the scheduler to
find another task (assuming another task exists).
Just as with system calls, the natural calling convention for
is that callee-saved registers will be preserved. Thus, the
implementation, which is in
k-exception.S, saves callee-saved registers
and the flags register to the stack, in the order determined by
proc::yield() then stores a pointer to this saved state in the current
and switches to executing
cpustate::schedule() on the current CPU stack,
using the magic kernel GS base to find the current CPU.
/* store yieldstate pointer */ movq %rsp, 16(%rdi) /* disable interrupts, switch to cpustack */ cli movq %rdi, %rsi movq %gs:(0), %rdi leaq CPUSTACK_SIZE(%rdi), %rsp /* call scheduler */ jmp _ZN8cpustate8scheduleEP4proc
It is safe to set the stack pointer unconditionally to the top of the CPU
stack (that is, to
CPUSTACK_SIZE(%rdi)) because the CPU stack is unused
unless the scheduler is actively running. Note that the
function does not return—it has no
ret instruction. Instead, it changes
its stack pointer and jumps directly to the
However, the return address of
proc::yield() is preserved; it’s stored on
the kernel task stack, and will be used later.
A task that has yielded voluntarily will resume only when the scheduler
decides it should run again. The following code, in
performs the resume:
After loading the current process’s pagetable, this code calls
proc::resume() on that process. The code for that function is also defined
k-exception.S; in this case, it simply changes the stack pointer to
point at the stored
struct yieldstate, restores the caller-saved registers,
and returns. The return instruction in
proc::resume() uses the return
address that was pushed onto the stack when
proc::yield() was called.
Kernel exceptions: involuntary kernel context switch
Chickadee user contexts always run with interrupts enabled. Chickadee kernel tasks, however, start with interrupts disabled. This means that interrupts, such as timer interrupts, are delayed from delivery until a user context starts running.
Context switch mechanisms that switch to kernel mode, namely exception
delivery and the
syscall instruction, automatically disable interrupts as
part of the switch. (This is controlled by the “gate type” for each interrupt
descriptor, and by the special
MSR_IA32_MFLAGS register used to configure
iretq instruction that returns to user context automatically
However, interrupts can indicate important, latency-sensitive hardware events,
so disabling interrupts for a long time can cause performance problems.
Chickadee therefore allows kernel tasks to re-enable interrupts. For an
example, see the implementation of the
SYSCALL_PAUSE system call.
A kernel task with interrupts enabled might be interrupted, causing an involuntary context switch in kernel mode. This engages the usual hardware exception mechanism, but since in this case there is no privilege change, the hardware behaves a little differently. Rather than switching to a new stack, it pushes the five critical registers onto the currently active stack, which is always a kernel task stack (kernel CPU contexts never enable interrupts).
exception_entry entry point must check whether the exception was
received in kernel mode:
testb $3, 32(%rsp) jz 1f
The first instruction checks the lower two bits of the pushed
which hold the exception-time privilege level. If the exception happened in
kernel mode, those bits are zero, and the
jz 1f instruction will skip over
the steps that modify the kernel GS base and copy the saved registers to the
kernel task stack. A similar privilege-level check is necessary in the restore
Starting a process
You may have noticed that in all these prior context switches, a user context
resumes only when the relevant kernel task context returns, via
instructions, to the assembly code from the initial entry point
syscall_entry). So how can a process run for
the first time, with no corresponding entry code?
Newly started processes—whether started at initialization time, or later via
fork—are given a constructed
regstate, stored on the corresponding kernel
task stack. The kernel initializes this
regstate and stores a pointer to it
in the corresponding
struct proc. The
proc::resume() code looks not only
for the yield state pushed in case of voluntary kernel context switch, but
also for a
regstate. If one is found,
proc::resume() jumps directly to the
assembly that pops general-purpose registers and restores sensitive registers