The primary task for operating system kernels is to multiplex hardware resources among interested processes. This means kernels must switch contexts, from process to process and from process to kernel and back. Some context switches are voluntary; for example, a system call intentionally and synchronously transfers control to the kernel. In other context switches, a task switches contexts unexpectedly, because of an interrupt or other exception.
Context switches are difficult because they require changing internal machine state. Safely switching the CPU away from a task requires saving that task’s state so it can resume later. Saving the state involves computation, which in turn requires modifying registers—but in many cases, these registers can’t be modified until they’re saved! Context switching must not clobber important register state, so it requires careful coding and special hardware support.
Kinds of context
Chickadee supports three kinds of context: user context, kernel task context, and CPU context. User context is unprivileged; kernel task context and CPU context have full machine privilege.
A user context corresponds to unprivileged process code. When the
processor is running in user context, the processor is unprivileged ((%cs & 3) == 3
) and interrupts are always enabled.
Most kernel code runs in kernel task context. A kernel task is a
schedulable entity that runs with kernel privilege. Chickadee can suspend
kernel tasks; for instance, a Chickadee system call implementation can
voluntarily or involuntarily give up control to another task, picking up later
right where it left off. To make this work, each kernel task context has its
own kernel task stack. When the processor is running a kernel task, the %rsp
register points into the corresponding stack. Just like any stack, a kernel
task stack holds local variables. It also holds a snapshot of the task’s
registers when the task is suspended.
Not all operating system kernels support kernel task suspension. For instance, WeensyOS, the CS61 operating system, does not, and microkernels often do not. In non-suspendable operating systems, a user thread can still block, but all kernel state associated with a blocked thread must be managed explicitly by the kernel programmer. Kernel task suspension makes some kinds of kernel programming easier, but it also requires more memory.
CPU contexts are used when switching from one kernel task to another. Each
CPU has its own CPU context; while kernel tasks can switch among CPUs, a CPU
context is pinned to a single CPU. CPU contexts are not suspendable, and each
CPU context has its own stack. The only kernel functions that run in CPU
context are entry points from user context (exception_entry
,
alt_exception_entry
) and cpustate::schedule
.
Kernel stacks are collocated on memory pages that contain other data
structures. The page of memory containing a kernel task stack also holds a
struct proc
, and the page of memory containing a CPU context stack also
holds a struct cpustate
. These structures are located in the beginnings of
their pages (starting at the first address of the page), whereas the kernel
stacks are located at the ends of the pages and grow down. Kernel functions
shouldn’t use too many local variables or recursively call other functions too
deeply, or stack data is liable to crash into and destroy the cpustate
and/or proc
.
struct regstate
Context switches require saving a set of registers and restoring them later.
In Chickadee, this process centers on struct regstate
structures. This
structure, which is defined in x86-64.h
, has space for all the registers
Chickadee programs may use.
x86-64 register sets are large—struct regstate
takes 192 bytes—and saving
and restoring that much state is expensive. Optimized OSes avoid full state
save and restore when possible.
Chickadee doesn’t provide support for saving and restoring floating-point registers or SIMD registers (MMX, SSE).
User exceptions: involuntary user context switch
We first consider what happens when the CPU takes an exception while executing
user code. These are generally interrupts and faults, and thus involve
involuntary context switches initiated from hardware, but traps (intentional
exceptions caused by an int
or int3
instruction) use the same mechanism.
The x86-64 exception mechanism involves several tables set up by software and interpreted by microcode (processor hardware). In particular, when an exception occurs:
-
The hardware looks up the exception number in the interrupt descriptor table (IDT). This table holds an entry for each supported exception.
-
The IDT entry contains an entry point, which is the start address of the function that will handle the interrupt, and a task segment selector.
-
The task segment selector is an index into another table, the general descriptor table (GDT). This weird table holds multiple kinds of entries, but in the end, through several layers of indirection, it defines the initial stack pointer for the interrupt handler and the privilege mode for the interrupt handler.
-
The hardware now knows the entry point to the interrupt handler, the privilege mode in which that handler should run, and the initial stack pointer for that handler. It pushes five critical registers onto the handler stack,
%rsp
,%ss
,%rflags
,%cs
, and%rip
. Then it changes those critical registers to new values.%rip
is set to the handler entry point,%rsp
is set to the handler stack (minus 40 bytes, to account for the pushed registers), and%cs
,%ss
, and%rflags
are updated to account for the handler’s new privilege.
Then the handler software takes over.
Chickadee initializes the x86-64 exception mechanism as follows.
-
The OS has a single global interrupt descriptor table,
interrupt_descriptors
, that’s shared by all cores. -
Each CPU has its own GDT and task state. Different CPUs’ GDTs are very similar except that each CPU defines its own initial stack pointer for interrupts, ensuring that interrupts on different processors can happen simultaneously without colliding. A CPU’s initial stack pointer is the top of the corresponding CPU context stack. The GDT and task state are stored in
struct cpustate
.
Chickadee’s exception handler starts execution on the CPU-context stack. But the user context’s state, including the five critical registers pushed by the processor’s exception delivery mechanism, belong in space owned by the user context. For this reason, the exception handler immediately switches to the current thread’s kernel task stack and moves the saved registers there. It must do this without corrupting the other general-purpose registers, which still contain user-level values.
These lines in exception_entry
transfer the state.
// change %rsp to the top of the kernel task stack
swapgs
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp
// copy data from CPU stack to kernel task stack
pushq %gs:(CPUSTACK_SIZE - 8) // %ss
...
pushq %gs:(CPUSTACK_SIZE - 56) // interrupt number
This code uses a special feature in x86-64 processors called the kernel GS
base, which is a hidden register in which the kernel can stash a
per-CPU pointer value. In Chickadee, each CPU sets this register to point to
its cpustate
, which is the bottom of the memory page containing its CPU
stack. These instructions thus act as follows:
-
swapgs
: Swaps in the kernel GS base for the current CPU. This step will be undone by anotherswapgs
instruction when control is returned to the user. -
movq %gs:(8), %rsp
: Once the kernel GS base is installed, a memory reference like%gs:(N)
refersN
bytes into the current CPU’scpustate
structure. Here, the value%gs:(8)
corresponds tothis_cpu()->current_
, which is the address of thestruct proc
for the currently-running user process on this CPU. So this instruction changes%rsp
tocurrent_
. -
addq $PROCSTACK_SIZE, %rsp
: This moves%rsp
to the top of the current process’s kernel task stack. -
pushq %gs:(CPUSTACK_SIZE - N)
: These lines transfer data from the CPU stack to the kernel task stack, one quadword at a time. The address%gs:(CPUSTACK_SIZE - 8)
points to the top quadword on the CPU stack.
Because the offsets are hard-coded (addq $PROCSTACK_SIZE, %rsp
and
%gs:(CPUSTACK_SIZE - N)
), this code will work only
if both the CPU stack and the currently-executing process’s kernel task stack
were empty when the exception occurred. But that’s always true
for user exceptions! The
CPU stack is empty unless the scheduler is running, and a process’s kernel
task stack is empty when the process is executing in user mode.
The rest of exception_entry
is more straightforward: the handler
pushes the rest of the general-purpose registers onto the stack and then calls
proc::exception(regstate*)
.
Within proc::exception
, the current proc
is accessible as this
, and the
saved regstate
is accessible as the function argument.
When proc::exception
returns, the exception handler must resume
user code where it left off. It does this by popping general-purpose
registers, calling swapgs
again, and finally ending with iretq
, a special
instruction that pops %rip
, %cs
, %rflags
, %ss
, and %rsp
from the stack and lowers CPU privilege accordingly.
System calls: voluntary user context switch
Voluntary context switches, such as system calls, don’t need to save state in
the same comprehensive way as involuntary context switches. A calling
convention can be established that lets the kernel cut corners. Chickadee
system calls are treated like function calls: the kernel only guarantees that
it will save and restore the callee-saved registers, which are %rbp
,
%rbx
, and %r12-%r15
.
Chickadee system calls use the x86-64 syscall
instruction, which was
designed specifically for system calls on modern operating systems. syscall
is lighter-weight than the exception mechanism: where exceptions save several
critical registers on a predefined kernel stack, syscall
just juggles some
registers around. In particular, syscall
:
- Saves the old
%rflags
in%r11
. - Saves the old
%rip
in%rcx
. - Changes
%rip
to the system call entry point, which was predeclared by writing to a special hidden register,MSR_IA32_LSTAR
. - Changes the processor’s privilege levels and flags according to other
special hidden registers,
MSR_IA32_STAR
andMSR_IA32_FMASK
.
syscall
does not, for example, modify %rsp
, which is unchanged from the
user value. This means the syscall
entry code must jump through some hoops
to obtain a meaningful stack pointer.
For simplicity, the Chickadee system call entry point sets up state analogous
to the usual exception entry point. Here’s how it works. syscall_entry
starts this way:
swapgs
movq %rsp, %gs:(16)
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp
These lines, like the corresponding lines in exception_entry
, switch
the stack pointer to the currently-running process’s kernel task stack. There
is one wrinkle: before switching the stack pointer, the code saves the current
stack pointer in a special scratch area in the cpustate
, %gs:(16)
.
The next step is to push the five critical registers (%ss
, %rsp
,
%rflags
, %cs
, and %rip
) in the same way that the x86-64 exception
mechanism would have:
pushq $(SEGSEL_APP_DATA + 3) /* %ss */
pushq %gs:(16) /* %rsp */
pushq %r11 /* %rflags */
pushq $(SEGSEL_APP_CODE + 3) /* %cs */
pushq %rcx /* %rip */
Breaking this down:
- All user code runs with the same
%ss
value,SEGSEL_APP_DATA
. The+ 3
represents user privilege mode (“DPL”). - We stashed the old
%rsp
value in thecpustate
at%gs:(16)
. - The
syscall
instruction put the old%rflags
value in%r11
. - All user code runs with the same
%cs
value,SEGSEL_APP_CODE
. Again, the+ 3
represents user privilege mode (“CPL”). - The
syscall
instruction put the old%rip
value in%rcx
.
syscall_entry
then follows basically the same steps as
exception_entry
, except it calls proc::syscall(regstate*)
rather
than proc::exception(regstate*)
.
When proc::syscall(regstate*)
returns, its return value should be passed
back to the user process and the user process should be resumed. But the
syscall_entry
return path is simpler than the exception return path.
Here it is:
addq $(8 * 15), %rsp // skip general-purpose registers
pop %fs // restore `%fs`
cli // prevent interrupts after `swapgs`
swapgs
pop %gs // restore `%gs`
addq $(8 * 2), %rsp // skip reg_swapgs, reg_intno, reg_errcode
iretq
Why is it (relatively) safe to skip restoring general-purpose registers? The
reason is that both the Chickadee kernel and the Chickadee system call
interface follow the compiler’s x86-64 calling convention. Chickadee processes
expect callee-saved registers to be preserved by syscall
. But these are
exactly the registers that the C++ compiler preserves across function calls!
Since every kernel function preserves the callee-saved registers (by saving
them on entry and restoring them on exit), the syscall_entry
assembly code
need not restore them. The caller-saved registers may have changed, but
that’s OK—the process code expects them to change.
Kernel yields: voluntary kernel context switch
A kernel task can voluntarily surrender its control of the CPU by calling
proc::yield()
. This function stores just enough state that the kernel task
can be resumed later, then switches to CPU context and runs the scheduler to
find another task (assuming another task exists).
Just as with system calls, the natural calling convention for proc::yield()
is that callee-saved registers will be preserved. Thus, the proc::yield()
implementation, which is in k-exception.S
, saves callee-saved registers
and the flags register to the stack, in the order determined by struct yieldstate
.
proc::yield()
then stores a pointer to this saved state in the current proc
and switches to executing cpustate::schedule()
on the current CPU stack,
using the magic kernel GS base to find the current CPU.
// clear interrupts and store yieldstate pointer
cli
movq %rsp, 16(%rdi)
// switch to cpustack
movq %gs:(0), %rdi
leaq CPUSTACK_SIZE(%rdi), %rsp
// jump to scheduler
jmp _ZN8cpustate8scheduleEv
It is safe to set the stack pointer unconditionally to the top of the CPU
stack (that is, to CPUSTACK_SIZE(%rdi)
) because the CPU stack is unused
unless the scheduler is actively running. Note that the proc::yield()
function does not return—it has no ret
instruction. Instead, it changes
its stack pointer and jumps directly to the cpustate::schedule()
function.
However, the return address of proc::yield()
is preserved; it’s stored on
the kernel task stack, and will be used later.
A task that has yielded voluntarily will resume only when the scheduler
decides it should run again. The following code, in cpustate::schedule()
,
performs the resume:
set_pagetable(current_->pagetable_);
current_->resume();
After loading the current process’s pagetable, this code calls
proc::resume()
on that process. The code for that function is also defined
in k-exception.S
; in this case, it simply changes the stack pointer to
point at the stored struct yieldstate
, restores the caller-saved registers,
and returns. The return instruction in proc::resume()
uses the return
address that was pushed onto the stack when proc::yield()
was called.
Kernel exceptions: involuntary kernel context switch
Chickadee user contexts always run with interrupts enabled. Chickadee kernel tasks, however, start with interrupts disabled. This means that interrupts, such as timer interrupts, are delayed from delivery until a user context starts running.
Context switch mechanisms that switch to kernel mode, namely exception
delivery and the syscall
instruction, automatically disable interrupts as
part of the switch. (This is controlled by the “gate type” for each interrupt
descriptor, and by the special MSR_IA32_MFLAGS
register used to configure
syscall
.) The iretq
instruction that returns to user context automatically
re-enables interrupts.
However, interrupts can indicate important, latency-sensitive hardware events,
so disabling interrupts for a long time can cause performance problems.
Chickadee therefore allows kernel tasks to re-enable interrupts. For an
example, see the implementation of the SYSCALL_PAUSE
system call.
A kernel task with interrupts enabled might be interrupted, causing an involuntary context switch in kernel mode. This engages the usual hardware exception mechanism, but since in this case there is no privilege change, the hardware behaves a little differently. Rather than switching to a new stack, it pushes the five critical registers onto the currently active stack, which is always a kernel task stack (CPU contexts never enable interrupts).
The exception_entry
entry point must check whether the exception was
received in kernel mode:
testb $3, 32(%rsp)
jz 1f
The first instruction checks the lower two bits of the pushed %cs
selector,
which hold the exception-time privilege level. If the exception happened in
kernel mode, those bits are zero, and the jz 1f
instruction will skip over
the steps that modify the kernel GS base and copy the saved registers to the
kernel task stack. A similar privilege-level check is necessary in the restore
path.
Starting a process
You may have noticed that in all these prior context switches, a user context
resumes only when the relevant kernel task context returns, via ret
instructions, to the assembly code from the initial entry point
(exception_entry
or syscall_entry
). So how can a process run for
the first time, with no corresponding entry code?
Newly started processes—whether started at initialization time, or later via
fork
—are given a constructed regstate
, stored on the corresponding kernel
task stack. The kernel initializes this regstate
and stores a pointer to it
in the corresponding struct proc
. The proc::resume()
code looks not only
for the yield state pushed in case of voluntary kernel context switch, but
also for a regstate
. If one is found, proc::resume()
jumps directly to the
assembly that pops general-purpose registers and restores sensitive registers
from that regstate
.