Lecture 2: Isolation and exceptions

Memory and isolation

We need to talk about memory, registers*, and process isolation!
*Registers are a kind of memory

Registers

The fastest storage medium available to software
Kilobytes; less than a nanosecond to access; SRAM technology
Named, not numbered! On x86-64, %rax, %rbx, %rsp, %r9, %cr3…
One set of registers per CPU core (or hyperthread)
Registers on one core are inaccessible to others
The registers accessed by an instruction are encoded into the instruction

Example:

0:  89 d8         movl %ebx, %eax
2:  89 d1         movl %ecx, %edx
4:  48 89 d8      movq %rbx, %rax
3:  4c 89 c8      movq %r9, %rax

Memory (aka primary memory)

The next fastest storage medium available to software
Gigabytes; tens of nanoseconds to access; DRAM technology
CPU caches speed this up
Numbered addresses; on x86-64, physical addresses range over \([0, 2^{52}-1)\) (or less, depending on CPU model)
Unified view across all CPU cores: any core can access any physical address
Address accessed by an instruction can be encoded into the instruction or indirect, i.e., the address depends on machine state

Example:

# uint64_t* ptr = (uint64_t*) 0x1000; *ptr = %rax
0: 48 89 04 25 00 10 00 00    movq %rax, 0x1000
# uint64_t* ptr = (uint64_t*) %rsp; *ptr = %rax
8: 48 89 04 24                movq %rax, (%rsp)

CPU instructions are read from primary memory

Process isolation

The kernel is the software that runs with full machine privilege over all computer resources (modulo virtual machine monitors…)
The kernel safely shares machine resources with unprivileged processes, according to OS policy
Process isolation means processes cannot trick the kernel into violating these policies
- Processes can’t examine other processes’ internal state unless allowed
- Processes can’t lock out other code from running, maliciously or accidentally
Kernel isolation is a necessary part of process isolation: unprivileged processes mustn’t be able to trick the kernel into running arbitrary code in privileged mode
Which means the kernel’s instructions must be in primary memory processes cannot modify
If OS policy involves any secrets at all (almost always it does!), then kernel instructions and data must be in primary memory processes cannot read

So process software runs without full access to machine resources
Kernel ensures this
But how?
- Virtualization
- Each process runs in a kind of virtual computer, where the process appears to have access to machine resources, but it is prevented from messing up too bad
- Virtualization can be implemented entirely in software (interpretation), or with hardware support
- Way faster to use hardware support!

Hardware virtualization

Software accesses registers, memory, and the CPU
- Sure, other devices too, sometimes
Most important resources for hardware virtualization support

Dangerous registers, dangerous instructions

A CPU register defines the current privilege level
- On x86-64, lower two bits of %cs
- CPL 0 is full machine privilege
- CPL 1–3 is unprivileged (most OSes use only CPL 3)
Registers that would violate process isolation can only be accessed at CPL 0
Instructions that could violate process isolation can only be executed at CPL 0
Modern processors have support for an even more privileged level, like “CPL -1”, that supports virtual machine monitors; there are instructions that can only be accessed at that privilege

Wait a minute!

Process isolation is an operating system policy!
How can hardware know what isolation policy an OS will enforce?
It doesn’t
But most modern operating systems have similar policies
- An unprivileged process shouldn’t be able to monopolize any resource or deny it to other processes
If an operating system has a different policy, it can implement it by adding more virtualization techniques

Memory protection

Sections of memory can only be accessed at CPL 0

Time protection

Software running at CPL 0 can run indefinitely
Software running at other CPLs cannot run indefinitely

Where are the cops?

Computer software runs in a low-trust environment
Whenever you see a statement like “cannot run indefinitely” or “can only be accessed at CPL 0”, ask “or what?”
- What will the hardware do if a process runs an infinite loop, or tries to execute a dangerous instruction?

Exceptions

Exceptional control flow (traps, faults, interrupts)
The hardware validates dangerous operations
- Processors validate dangerous instructions
- Timer keeps track of time
If unprivileged software does something illegal, hardware stops executing the unprivileged software and runs the kernel instead

Exceptions and virtual computers

But the OS must determine how exceptions are handled
- OS policy decides how the virtual computer that runs processes should behave
Maybe the hardware and the OS disagree on how illegal something is
So what happens to the processor’s state on an exception?
- Especially registers, which the kernel will definitely need!
CPL is set to privileged mode
Instruction pointer %rip is set to a kernel instruction
How are the new mode and kernel instruction configured?
What happens to the old CPL and %rip?

Processor configuration

Privileged system registers configure exceptions
- The general descriptor table (GDT) and interrupt descriptor table (IDT)
IDT defines entry points for every possible exception
- What kernel instruction will start handling the exception? (The entry point)
- Can the exception be invoked in software, like the debug breakpoint int3?
- What CPL will be used for handling the exception? (by reference to GDT)
- …
GDT defines available privilege modes
GDT also defines a task state segment
- Preconfigured location at which exception-time registers are saved (e.g., old CPL, old %rip)

Registers saved by hardware during an exception

// end of struct regstate:
    uint64_t reg_rip;        // instruction pointer
    uint64_t reg_cs;         // CPL
    uint64_t reg_rflags;     // flags (including privilege flags)
    uint64_t reg_rsp;        // stack pointer
    uint64_t reg_ss;

Hardware aims to save as few registers as possible (for efficiency)
Kernel software is responsible for saving everything else

// k-exception.S:
exception_entry_3:
    pushq $0
    pushq $3
    jmp exception_entry
exception_entry:
    ...
    push %gs
    push %fs
    pushq %r15
    pushq %r14
    ...
    pushq %rdx
    pushq %rcx
    pushq %rax

There’s more than one way to do it: in ARM, each exception level has its own special registers used only for saving exception-time state (e.g., ELR_EL1, SPSR_EL1)

Example from WeensyOS

// Top of the kernel stack
#define KERNEL_STACK_TOP        0x80000
static uint64_t gdt_segments[7];
static x86_64_taskstate taskstate;
...

    // IDT
    for (int i = 0; i < 256; ++i) {
        uintptr_t handler_function = interrupt_descriptors[i].gd_low;
        set_gate(&interrupt_descriptors[i], handler_function,
                 X86GATE_INTERRUPT, i == INT_BP ? 3 : 0, 0);
    }
    x86_64_pseudodescriptor idt;
    idt.limit = sizeof(interrupt_descriptors) - 1;
    idt.base = (uint64_t) interrupt_descriptors;

    // GDT, TSS
    memset(&taskstate, 0, sizeof(taskstate));
    taskstate.ts_rsp[0] = KERNEL_STACK_TOP; // address to store exception-time registers

    set_app_segment(&gdt_segments[0], X86SEG_X | X86SEG_L, 0); ...
    set_sys_segment(&gdt_segments[0x28 >> 3], (uintptr_t) &taskstate, sizeof(taskstate), X86SEG_TSS, 0);
    x86_64_pseudodescriptor gdt;
    gdt.limit = sizeof(gdt_segments) - 1;
    gdt.base = (uint64_t) gdt_segments;

    // install
    asm volatile("lgdt [&gdt.limit]; ltr $0x28; lidt [&idt.limit]");

But WeensyOS is a uniprocessor operating system…

Multiprocessor exceptions

In a multicore/multiprocessor computer, software is running on multiple cores, simultaneously and independently
What happens if exceptions occur on different cores at the same time?
Must configure per-CPU locations to store exception-time state!

`struct cpustate`

Chickadee defines one struct cpustate per supported CPU, up to 16
Each cpustate has its own GDT, TSS, and stack area for saving exception-time state
cpustates are stored in a global array, cpus

struct __attribute__((aligned(4096))) cpustate {
    ...
    uint64_t gdt_segments_[7];
    x86_64_taskstate taskstate_;
};

    // GDT, TSS
    memset(&taskstate_, 0, sizeof(taskstate_)); // was global
    taskstate_.ts_rsp[0] = (uintptr_t) this + CPUSTACK_SIZE; // was KERNEL_STACK_TOP

    set_app_segment(&gdt_segments_[0], X86SEG_X | X86SEG_L, 0); ...
    set_sys_segment(&gdt_segments_[0x28 >> 3], (uintptr_t) &taskstate_, sizeof(taskstate_), X86SEG_TSS, 0);
    x86_64_pseudodescriptor gdt;
    gdt.limit = sizeof(gdt_segments_) - 1;
    gdt.base = (uint64_t) gdt_segments_;

    // install
    asm volatile("lgdt [&gdt.limit]; ltr $0x28; lidt [&idt.limit]");

How can code tell which CPU took the exception??

When the kernel starts running because of an exception, how can it tell which CPU it’s running on?
- CPU architectures might make different choices!
- Different interrupt handler instructions per CPU? (E.g., CPU \(n\)’s interrupt handler is at \(\texttt{0x...8010202d} + \texttt{0x100000}\times n\))
- Instruction that returns the CPU index? (There is such an instruction, but it’s slow)
x86-64 offers a privileged register, KERNEL_GSBASE, that only the kernel can change
Chickadee sets each CPU’s KERNEL_GSBASE to point at the corresponding cpustate
In assembly, an address like %gs:(0) is interpreted relative to KERNEL_GSBASE, and therefore relative to cpustate

Lecture 2: Isolation and exceptions

Memory and isolation

Registers

Memory (aka primary memory)

Process isolation

Safe sharing

Hardware virtualization

Dangerous registers, dangerous instructions

Wait a minute!

Memory protection

Time protection

Where are the cops?

Exceptions

Exceptions and virtual computers

Processor configuration

Registers saved by hardware during an exception

Example from WeensyOS

Multiprocessor exceptions

struct cpustate

How can code tell which CPU took the exception??

`struct cpustate`