Memory and isolation
- We need to talk about memory, registers*, and process isolation!
*Registers are a kind of memory
Registers
- The fastest storage medium available to software
- Kilobytes; less than a nanosecond to access; SRAM technology
- Named, not numbered! On x86-64,
%rax
,%rbx
,%rsp
,%r9
,%cr3
… - One set of registers per CPU core (or hyperthread)
- Registers on one core are inaccessible to others
- The registers accessed by an instruction are encoded into the instruction
- Example:
0: 89 d8 movl %ebx, %eax 2: 89 d1 movl %ecx, %edx 4: 48 89 d8 movq %rbx, %rax 3: 4c 89 c8 movq %r9, %rax
Memory (aka primary memory)
- The next fastest storage medium available to software
- Gigabytes; tens of nanoseconds to access; DRAM technology
- CPU caches speed this up
- Numbered addresses; on x86-64, physical addresses range over \([0, 2^{52}-1)\) (or less, depending on CPU model)
- Unified view across all CPU cores: any core can access any physical address
- Address accessed by an instruction can be encoded into the instruction or indirect, i.e., the address depends on machine state
- Example:
# uint64_t* ptr = (uint64_t*) 0x1000; *ptr = %rax 0: 48 89 04 25 00 10 00 00 movq %rax, 0x1000 # uint64_t* ptr = (uint64_t*) %rsp; *ptr = %rax 8: 48 89 04 24 movq %rax, (%rsp)
- CPU instructions are read from primary memory
Process isolation
- The kernel is the software that runs with full machine privilege over all computer resources (modulo virtual machine monitors…)
- The kernel safely shares machine resources with unprivileged processes, according to OS policy
- Process isolation means processes cannot trick the kernel into violating
these policies
- Processes can’t examine other processes’ internal state unless allowed
- Processes can’t lock out other code from running, maliciously or accidentally
- Kernel isolation is a necessary part of process isolation: unprivileged processes mustn’t be able to trick the kernel into running arbitrary code in privileged mode
- Which means the kernel’s instructions must be in primary memory processes cannot modify
- If OS policy involves any secrets at all (almost always it does!), then kernel instructions and data must be in primary memory processes cannot read
Safe sharing
- So process software runs without full access to machine resources
- Kernel ensures this
- But how?
- Virtualization
- Each process runs in a kind of virtual computer, where the process appears to have access to machine resources, but it is prevented from messing up too bad
- Virtualization can be implemented entirely in software (interpretation), or with hardware support
- Way faster to use hardware support!
Hardware virtualization
- Software accesses registers, memory, and the CPU
- Sure, other devices too, sometimes
- Most important resources for hardware virtualization support
Dangerous registers, dangerous instructions
- A CPU register defines the current privilege level
- On x86-64, lower two bits of
%cs
- CPL 0 is full machine privilege
- CPL 1–3 is unprivileged (most OSes use only CPL 3)
- On x86-64, lower two bits of
- Registers that would violate process isolation can only be accessed at CPL 0
- Instructions that could violate process isolation can only be executed at CPL 0
- Modern processors have support for an even more privileged level, like “CPL -1”, that supports virtual machine monitors; there are instructions that can only be accessed at that privilege
Wait a minute!
- Process isolation is an operating system policy!
- How can hardware know what isolation policy an OS will enforce?
- It doesn’t
- But most modern operating systems have similar policies
- An unprivileged process shouldn’t be able to monopolize any resource or deny it to other processes
- If an operating system has a different policy, it can implement it by adding more virtualization techniques
Memory protection
- Sections of memory can only be accessed at CPL 0
Time protection
- Software running at CPL 0 can run indefinitely
- Software running at other CPLs cannot run indefinitely
Where are the cops?
- Computer software runs in a low-trust environment
- Whenever you see a statement like “cannot run indefinitely” or “can only be
accessed at CPL 0”, ask “or what?”
- What will the hardware do if a process runs an infinite loop, or tries to execute a dangerous instruction?
Exceptions
- Exceptional control flow (traps, faults, interrupts)
- The hardware validates dangerous operations
- Processors validate dangerous instructions
- Timer keeps track of time
- If unprivileged software does something illegal, hardware stops executing the unprivileged software and runs the kernel instead
Exceptions and virtual computers
- But the OS must determine how exceptions are handled
- OS policy decides how the virtual computer that runs processes should behave
- Maybe the hardware and the OS disagree on how illegal something is
- So what happens to the processor’s state on an exception?
- Especially registers, which the kernel will definitely need!
- CPL is set to privileged mode
- Instruction pointer
%rip
is set to a kernel instruction - How are the new mode and kernel instruction configured?
- What happens to the old CPL and
%rip
?
Processor configuration
- Privileged system registers configure exceptions
- The general descriptor table (GDT) and interrupt descriptor table (IDT)
- IDT defines entry points for every possible exception
- What kernel instruction will start handling the exception? (The entry point)
- Can the exception be invoked in software, like the debug breakpoint
int3
? - What CPL will be used for handling the exception? (by reference to GDT)
- …
- GDT defines available privilege modes
- GDT also defines a task state segment
- Preconfigured location at which exception-time registers are saved
(e.g., old CPL, old
%rip
)
- Preconfigured location at which exception-time registers are saved
(e.g., old CPL, old
Registers saved by hardware during an exception
// end of struct regstate:
uint64_t reg_rip; // instruction pointer
uint64_t reg_cs; // CPL
uint64_t reg_rflags; // flags (including privilege flags)
uint64_t reg_rsp; // stack pointer
uint64_t reg_ss;
- Hardware aims to save as few registers as possible (for efficiency)
- Kernel software is responsible for saving everything else
// k-exception.S:
exception_entry_3:
pushq $0
pushq $3
jmp exception_entry
exception_entry:
...
push %gs
push %fs
pushq %r15
pushq %r14
...
pushq %rdx
pushq %rcx
pushq %rax
- There’s more than one way to do it: in ARM, each exception level has its own
special registers used only for saving exception-time state (e.g.,
ELR_EL1
,SPSR_EL1
)
Example from WeensyOS
// Top of the kernel stack
#define KERNEL_STACK_TOP 0x80000
static uint64_t gdt_segments[7];
static x86_64_taskstate taskstate;
...
// IDT
for (int i = 0; i < 256; ++i) {
uintptr_t handler_function = interrupt_descriptors[i].gd_low;
set_gate(&interrupt_descriptors[i], handler_function,
X86GATE_INTERRUPT, i == INT_BP ? 3 : 0, 0);
}
x86_64_pseudodescriptor idt;
idt.limit = sizeof(interrupt_descriptors) - 1;
idt.base = (uint64_t) interrupt_descriptors;
// GDT, TSS
memset(&taskstate, 0, sizeof(taskstate));
taskstate.ts_rsp[0] = KERNEL_STACK_TOP; // address to store exception-time registers
set_app_segment(&gdt_segments[0], X86SEG_X | X86SEG_L, 0); ...
set_sys_segment(&gdt_segments[0x28 >> 3], (uintptr_t) &taskstate, sizeof(taskstate), X86SEG_TSS, 0);
x86_64_pseudodescriptor gdt;
gdt.limit = sizeof(gdt_segments) - 1;
gdt.base = (uint64_t) gdt_segments;
// install
asm volatile("lgdt [&gdt.limit]; ltr $0x28; lidt [&idt.limit]");
- But WeensyOS is a uniprocessor operating system…
Multiprocessor exceptions
- In a multicore/multiprocessor computer, software is running on multiple cores, simultaneously and independently
- What happens if exceptions occur on different cores at the same time?
- Must configure per-CPU locations to store exception-time state!
struct cpustate
- Chickadee defines one
struct cpustate
per supported CPU, up to 16 - Each
cpustate
has its own GDT, TSS, and stack area for saving exception-time state cpustate
s are stored in a global array,cpus
struct __attribute__((aligned(4096))) cpustate {
...
uint64_t gdt_segments_[7];
x86_64_taskstate taskstate_;
};
// GDT, TSS
memset(&taskstate_, 0, sizeof(taskstate_)); // was global
taskstate_.ts_rsp[0] = (uintptr_t) this + CPUSTACK_SIZE; // was KERNEL_STACK_TOP
set_app_segment(&gdt_segments_[0], X86SEG_X | X86SEG_L, 0); ...
set_sys_segment(&gdt_segments_[0x28 >> 3], (uintptr_t) &taskstate_, sizeof(taskstate_), X86SEG_TSS, 0);
x86_64_pseudodescriptor gdt;
gdt.limit = sizeof(gdt_segments_) - 1;
gdt.base = (uint64_t) gdt_segments_;
// install
asm volatile("lgdt [&gdt.limit]; ltr $0x28; lidt [&idt.limit]");
How can code tell which CPU took the exception??
- When the kernel starts running because of an exception, how can it tell
which CPU it’s running on?
- CPU architectures might make different choices!
- Different interrupt handler instructions per CPU? (E.g., CPU \(n\)’s interrupt handler is at \(\texttt{0x...8010202d} + \texttt{0x100000}\times n\))
- Instruction that returns the CPU index? (There is such an instruction, but it’s slow)
- x86-64 offers a privileged register,
KERNEL_GSBASE
, that only the kernel can change - Chickadee sets each CPU’s
KERNEL_GSBASE
to point at the correspondingcpustate
- In assembly, an address like
%gs:(0)
is interpreted relative toKERNEL_GSBASE
, and therefore relative tocpustate