Operating systems multiplex hardware resources, which requires switching between contexts. Some context switches are voluntary; for example, a system call intentionally and synchronously transfers control to the kernel. Other times, a task switches contexts unexpectedly, because of an interrupt or other exception.
Context switches are difficult because of state. Running tasks use many machine resources, including registers. Safely switching the CPU away from a task requires saving that task’s state so it can resume later. But saving state requires registers—registers the old task was using! It’s easy for a context switch to inadvertently clobber register state on which a task depends. Context switches work only given careful coding and special hardware support.
Kinds of context
Chickadee supports three kinds of context: user context, kernel task context, and kernel CPU context. User context is unprivileged; the two kinds of kernel context have full machine privilege.
Chickadee can suspend kernel tasks. For instance, a Chickadee system call
implementation can voluntarily or involuntarily give up control to another
task, picking up later right where it left off. To make this work, each kernel
task context has its own kernel stack. When the processor is running a
kernel task, the %rsp
register points into the corresponding stack. Just
like any stack, a kernel task stack holds local variables. It also holds a
snapshot of the task’s registers when the task is suspended.
Not all operating system kernels support kernel task suspension. For instance, WeensyOS, the CS61 operating system, does not, and microkernels often do not. In non-suspendable operating systems, a user thread can still block, but all kernel state associated with a blocked thread must be managed explicitly by the kernel programmer. Kernel task suspension makes some kinds of kernel programming easier, but it also requires more memory.
Kernel CPU contexts are used when switching from one kernel task to another. Each CPU has its own kernel CPU context; while kernel tasks can switch among CPUs, a CPU context is pinned to a single CPU. CPU contexts are not suspendable, and each CPU context has its own stack.
Kernel stacks share memory pages with corresponding data structures. The
bottom of the page of memory containing a CPU context stack holds a struct
cpustate
, and the bottom of the page of memory containing a kernel task stack
holds a struct proc
. Kernel functions shouldn’t use too many local variables
or recursively call other functions too deeply, or stack data is liable to
crash into and destroy the cpustate
and/or proc
.
struct regstate
Context switches require saving a set of registers and restoring them later.
In Chickadee, this process centers on struct regstate
structures. This
structure, which is defined in x86-64.h
, has space for all the registers
Chickadee programs may use.
x86-64 register sets are large—struct regstate
takes 192 bytes—and saving
and restoring that much state is expensive. Optimized OSes avoid full state
save and restore when possible.
Chickadee doesn’t provide support for saving and restoring floating-point registers or SIMD registers (MMX, SSE).
User exceptions: involuntary user context switch
We first consider what happens when the CPU takes an exception while executing
user code. These are generally interrupts and faults, and thus involve
involuntary context switches initiated from hardware, but traps (intentional
exceptions caused by an int
instruction) use the same mechanism.
The x86-64 exception mechanism involves several tables set up by software and interpreted by microcode (processor hardware). In particular, when an exception occurs:
The hardware looks up the exception number in the interrupt descriptor table (IDT). This table holds an entry for each supported exception.
The IDT entry contains an entry point, which is the start address of the function that will handle the interrupt, and a task segment selector.
The task segment selector is an index into another table, the general descriptor table (GDT). This weird table holds multiple kinds of entries, but in the end, through several layers of indirection, it defines the initial stack pointer for the interrupt handler and the privilege mode for the interrupt handler.
The hardware now knows the entry point to the interrupt handler, the privilege mode in which that handler should run, and the initial stack pointer for that handler. It pushes five critical registers onto the handler stack,
%rsp
,%ss
,%rflags
,%cs
, and%rip
. Then it changes those critical registers to new values.%rip
is set to the handler entry point,%rsp
is set to the handler stack (minus 40 bytes, to account for the pushed registers), and%cs
,%ss
, and%rflags
are updated to account for the handler’s new privilege.
Then the handler software takes over.
Chickadee initializes the x86-64 exception mechanism as follows.
The OS has a single global interrupt descriptor table,
interrupt_descriptors
, that’s shared by all cores.Each CPU has its own GDT and task state. Different CPUs’ GDTs are very similar except that each CPU defines its own initial stack pointer for interrupts, ensuring that interrupts on different processors can happen simultaneously without colliding. A CPU’s initial stack pointer is the top of the corresponding CPU context stack. The GDT and task state are stored in
struct cpustate
.
When Chickadee’s exception handler gains control from a user-mode exception, it is executing on the kernel CPU-context stack. But that stack is shared by all code running on the CPU; thread state, such as the five critical registers pushed by the exception delivery mechanism, belongs instead on kernel task stacks. So the exception handler’s first job is to switch to the current thread’s kernel task stack and move the saved registers there. It must do this without corrupting the other general-purpose registers, which still contain user-level values.
These lines in exception_entry
transfer the state.
// change %rsp to the top of the kernel task stack
swapgs
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp
// copy data from CPU stack to kernel task stack
pushq %gs:(CPUSTACK_SIZE - 8) // %ss
...
pushq %gs:(CPUSTACK_SIZE - 56) // interrupt number
This code uses a special feature in x86-64 processors called the kernel GS
base, which is a hidden register in which the kernel can stash a
per-CPU pointer value. In Chickadee, each CPU sets this register to point to
its cpustate
, which is the bottom of the memory page containing its CPU
stack. These instructions thus act as follows:
swapgs
: Swaps in the kernel GS base for the current CPU. This step will be undone by anotherswapgs
instruction when control is returned to the user.movq %gs:(8), %rsp
: Once the kernel GS base is installed, a memory reference like%gs:(N)
refersN
bytes into the current CPU’scpustate
structure. Here, the value%gs:(8)
corresponds tothis_cpu()->current_
, which is the address of thestruct proc
for the currently-running user process on this CPU. So this instruction changes%rsp
tocurrent_
.addq $PROCSTACK_SIZE, %rsp
: This moves%rsp
to the top of the current process’s kernel task stack.pushq %gs:(CPUSTACK_SIZE - N)
: These lines transfer data from the CPU stack to the kernel task stack, one quadword at a time. The address%gs:(CPUSTACK_SIZE - 8)
points to the top quadword on the CPU stack.
Because the offsets are hard-coded (addq $PROCSTACK_SIZE, %rsp
and
%gs:(CPUSTACK_SIZE - N)
), this code will work only
if both the CPU stack and the currently-executing process’s kernel task stack
were empty when the exception occurred. But that’s always true
for user exceptions! The
CPU stack is empty unless the scheduler is running, and a process’s kernel
task stack is empty when the process is executing in user mode.
The rest of exception_entry
is more straightforward: the handler
pushes the rest of the general-purpose registers onto the stack and then calls
proc::exception(regstate*)
.
Within proc::exception
, the current proc
is accessible as this
, and the
saved regstate
is accessible as the function argument.
When proc::exception
returns, the exception handler must resume
user code where it left off. It does this by popping general-purpose
registers, calling swapgs
again, and finally ending with iretq
, a special
instruction that pops %rip
, %cs
, %rflags
, %ss
, and %rsp
from the stack and lowers CPU privilege accordingly.
Saving and restoring
%gs
User-mode programs can modify
%gs
using instructions likemovw %ax, %gs
andpopw %gs
. So why doesn’t Chickadee reset%gs
to a known-good value on kernel entry? And why doesn’t Chickadeepopq %gs
inrestore_and_iret
when resuming a kernel-mode task?The reason is that loading
%gs
clears the GS base. When resuming a process, that’s no problem: current Chickadee processes never use the GS base. But in kernel mode the GS base (which was installed byswapgs
) is critically important and the%gs
numeric value doesn’t matter at all. Changing%gs
to a known-good value would restore something useless while clearing something important. So kernel tasks don’t bother.If processes did use the GS base—for instance, to access per-thread state—Chickadee would need to be a bit more careful.
The alternate exception entry point
A few uncommon exceptions—#DB (debug exceptions), #NM (non-maskable interrupts), and #MC (machine-check exceptions)—use a different mechanism called the
alt_exception_entry
. These exceptions are special because they can happen at any time, including while other interrupts are disabled. For instance, a #DB exception can trigger at the moment the normal exception handler gets control! This introduces problems difficult to handle without an alternate entry point, including an alternate exception stack (which in Chickadee starts 1024 bytes down from the top of the CPU stack).You’ll never encounter these exceptions in handout Chickadee on QEMU. #NM and #MC happen when hardware has serious problems. #DB, though, implements useful debugging features such as breakpoints and efficient memory watchpoints, so you may want to add support for it. It’s a little weird that #DB requires an alternate entry point, but x86-64 is weird.
Potential gotcha: Handlers for alternate-entry-point exceptions must not call
cpustate::schedule()
. Instead, they must useregs_ = regs; resume()
to resume the interrupted task. If an alternate-stack exception happens during a normal exception handler, the CPU stack can contain important information thatcpustate::schedule()
would destroy.
System calls: voluntary user context switch
Voluntary context switches, such as system calls, don’t need to save state in
the same comprehensive way as involuntary context switches. A calling
convention can be established that lets the kernel cut corners. Chickadee
system calls are treated like function calls: the kernel only guarantees that
it will save and restore the callee-saved registers, which are %rbp
,
%rbx
, and %r12-%r15
.
Chickadee system calls use the x86-64 syscall
instruction, which was
designed specifically for system calls on modern operating systems. syscall
is lighter-weight than the exception mechanism: where exceptions save several
critical registers on a predefined kernel stack, syscall
just juggles some
registers around. In particular, syscall
:
- Saves the old
%rflags
in%r11
. - Saves the old
%rip
in%rcx
. - Changes
%rip
to the system call entry point, which was predeclared by writing to a special hidden register,MSR_IA32_LSTAR
. - Changes the processor’s privilege levels and flags according to other
special hidden registers,
MSR_IA32_STAR
andMSR_IA32_FMASK
.
syscall
does not, for example, modify %rsp
, which is unchanged from the
user value. This means the syscall
entry code must jump through some hoops
to obtain a meaningful stack pointer.
For simplicity, the Chickadee system call entry point sets up state analogous
to the usual exception entry point. Here’s how it works. syscall_entry
starts this way:
swapgs
movq %rsp, %gs:(16)
movq %gs:(8), %rsp
addq $PROCSTACK_SIZE, %rsp
These lines, like the corresponding lines in exception_entry
, switch
the stack pointer to the currently-running process’s kernel task stack. There
is one wrinkle: before switching the stack pointer, the code saves the current
stack pointer in a special scratch area in the cpustate
, %gs:(16)
.
The next step is to push the five critical registers (%ss
, %rsp
,
%rflags
, %cs
, and %rip
) in the same way that the x86-64 exception
mechanism would have:
pushq $(SEGSEL_APP_DATA + 3) /* %ss */
pushq %gs:(16) /* %rsp */
pushq %r11 /* %rflags */
pushq $(SEGSEL_APP_CODE + 3) /* %cs */
pushq %rcx /* %rip */
Breaking this down:
- All user code runs with the same
%ss
value,SEGSEL_APP_DATA
. The+ 3
represents user privilege mode (“DPL”). - We stashed the old
%rsp
value in thecpustate
at%gs:(16)
. - The
syscall
instruction put the old%rflags
value in%r11
. - All user code runs with the same
%cs
value,SEGSEL_APP_CODE
. Again, the+ 3
represents user privilege mode (“CPL”). - The
syscall
instruction put the old%rip
value in%rcx
.
syscall_entry
then follows basically the same steps as
exception_entry
, except it calls proc::syscall(regstate*)
rather
than proc::exception(regstate*)
.
When proc::syscall(regstate*)
returns, its return value should be passed
back to the user process and the user process should be resumed. But the
syscall_entry
return path is much simpler than the exception return path.
Here it is:
addq $(8 * 19), %rsp
swapgs
iretq
We can get away with this because the C++ compiler naturally preserved the
callee-saved registers, the %rax
register already contains the desired
return value, and the syscall
calling convention says all other caller-saved
registers have garbage values when the system call returns. It thus suffices
to skip over all the saved general-purpose registers, restore the hidden
kernel GS base register, and execute iretq
to restore the user context.
Kernel yields: voluntary kernel context switch
A kernel task can voluntarily surrender its control of the CPU by calling
proc::yield()
. This function stores just enough state that the kernel task
can be resumed later, then switches to CPU context and runs the scheduler to
find another task (assuming another task exists).
Just as with system calls, the natural calling convention for proc::yield()
is that callee-saved registers will be preserved. Thus, the proc::yield()
implementation, which is in k-exception.S
, saves callee-saved registers
and the flags register to the stack, in the order determined by struct
yieldstate
.
proc::yield()
then stores a pointer to this saved state in the current proc
and switches to executing cpustate::schedule()
on the current CPU stack,
using the magic kernel GS base to find the current CPU.
/* store yieldstate pointer */
movq %rsp, 16(%rdi)
/* disable interrupts, switch to cpustack */
cli
movq %rdi, %rsi
movq %gs:(0), %rdi
leaq CPUSTACK_SIZE(%rdi), %rsp
/* call scheduler */
jmp _ZN8cpustate8scheduleEP4proc
It is safe to set the stack pointer unconditionally to the top of the CPU
stack (that is, to CPUSTACK_SIZE(%rdi)
) because the CPU stack is unused
unless the scheduler is actively running. Note that the proc::yield()
function does not return—it has no ret
instruction. Instead, it changes
its stack pointer and jumps directly to the cpustate::schedule()
function.
However, the return address of proc::yield()
is preserved; it’s stored on
the kernel task stack, and will be used later.
A task that has yielded voluntarily will resume only when the scheduler
decides it should run again. The following code, in cpustate::schedule()
,
performs the resume:
set_pagetable(current_->pagetable_);
current_->resume();
After loading the current process’s pagetable, this code calls
proc::resume()
on that process. The code for that function is also defined
in k-exception.S
; in this case, it simply changes the stack pointer to
point at the stored struct yieldstate
, restores the caller-saved registers,
and returns. The return instruction in proc::resume()
uses the return
address that was pushed onto the stack when proc::yield()
was called.
Kernel exceptions: involuntary kernel context switch
Chickadee user contexts always run with interrupts enabled. Chickadee kernel tasks, however, start with interrupts disabled. This means that interrupts, such as timer interrupts, are delayed from delivery until a user context starts running.
Context switch mechanisms that switch to kernel mode, namely exception
delivery and the syscall
instruction, automatically disable interrupts as
part of the switch. (This is controlled by the “gate type” for each interrupt
descriptor, and by the special MSR_IA32_MFLAGS
register used to configure
syscall
.) The iretq
instruction that returns to user context automatically
re-enables interrupts.
However, interrupts can indicate important, latency-sensitive hardware events,
so disabling interrupts for a long time can cause performance problems.
Chickadee therefore allows kernel tasks to re-enable interrupts. For an
example, see the implementation of the SYSCALL_PAUSE
system call.
A kernel task with interrupts enabled might be interrupted, causing an involuntary context switch in kernel mode. This engages the usual hardware exception mechanism, but since in this case there is no privilege change, the hardware behaves a little differently. Rather than switching to a new stack, it pushes the five critical registers onto the currently active stack, which is always a kernel task stack (kernel CPU contexts never enable interrupts).
The exception_entry
entry point must check whether the exception was
received in kernel mode:
testb $3, 32(%rsp)
jz 1f
The first instruction checks the lower two bits of the pushed %cs
selector,
which hold the exception-time privilege level. If the exception happened in
kernel mode, those bits are zero, and the jz 1f
instruction will skip over
the steps that modify the kernel GS base and copy the saved registers to the
kernel task stack. A similar privilege-level check is necessary in the restore
path.
Starting a process
You may have noticed that in all these prior context switches, a user context
resumes only when the relevant kernel task context returns, via ret
instructions, to the assembly code from the initial entry point
(exception_entry
or syscall_entry
). So how can a process run for
the first time, with no corresponding entry code?
Newly started processes—whether started at initialization time, or later via
fork
—are given a constructed regstate
, stored on the corresponding kernel
task stack. The kernel initializes this regstate
and stores a pointer to it
in the corresponding struct proc
. The proc::resume()
code looks not only
for the yield state pushed in case of voluntary kernel context switch, but
also for a regstate
. If one is found, proc::resume()
jumps directly to the
assembly that pops general-purpose registers and restores sensitive registers
from that regstate
.