Problem set 2 – CS 161 2019

In each pset, we will release changes to the baseline Chickadee repository with new functionality and tests. You should merge your code with ours. The easiest way:

git pull https://github.com/cs161/chickadee master

You will then have to fix up any conflicts. Follow Git’s instructions and commit the final result.

A. Kernel buddy allocator

The existing Chickadee allocator can only allocate memory in units of single pages, and it cannot free memory. That’s bad!

Freeing memory is not too hard to support; a singly-linked free list of pages could work. But we seek an efficient allocator for variable-sized allocations. For instance, Chickadee kernel stacks are currently limited to a couple thousand bytes because struct procs are allocated in units of single pages: a buddy allocator would let us support deeper stacks. And some later hardware structures, such as the structure for disk communication, require larger amounts of contiguous memory.

In this part, you will therefore implement a buddy allocator for kernel memory. Complete the following functions in k-alloc.cc:

void* kalloc(size_t sz)
void kfree(void* ptr)
void init_kalloc()

kalloc and kfree are the malloc and free equivalents. init_kalloc() should perform any initialization necessary so that the allocator has access to all available physical memory (i.e., physical memory with type mem_available in physical_ranges).

Buddy allocation

The buddy allocation algorithm is one of the oldest general algorithms for memory allocation. (It was invented in 1963!) Buddy allocation supports efficient splitting (breaking large contiguous blocks of free memory into smaller pieces) and coalescing (merging adjacent free blocks into larger contiguous blocks).

Buddy allocation is based on powers of 2 called orders. An allocation request of size \(s\) uses a block of size \(2^o\) and order \(o\), where

\[o = \left\lceil \log_2 s \right\rceil.\]

For example, an allocation request of size 4095 will use a block of size 4096 and order 12, but an allocation request of size 4097 will use a block of size 8192 and order 13. The physical address of a block of order \(o\) is always a multiple of \(2^o\).

Allocation splits large blocks until there’s a free block with the right size.

Let o be the desired order for the allocation.
Find a free block with order o. If found, return that block.
Otherwise, find a free block with order x > o, minimizing x. If found:
1. Split the block in two, creating two free blocks with order x-1. These two blocks are called order-(x-1) buddies because they are adjacent blocks with order x-1, and the address of one is easily calculated from the address of the other.
2. Repeat step 2.
If there’s no larger free block, the allocation fails: return nullptr.

Freeing coalesces freed buddies whenever possible. This can create large contiguous free blocks very efficiently.

Let o be the order of the freed block.
Check whether the freed block’s order-o buddy is also completely free. If it is:
1. Coalesce the freed block and the buddy into a single free block of order o + 1.
2. Repeat step 2.

Buddy allocators have low external fragmentation (unusable space between allocated blocks) because they aggressively coalesce freed memory into larger blocks. Their internal fragmentation (unused space inside allocated blocks) is at most 50%. Many real-world allocators use buddy allocation, often coupled with other techniques (for, e.g., efficient multicore support); for instance, both Linux and jemalloc use buddy allocation.

Designing your allocator

Your buddy allocator will use two constant parameters, min_order and max_order, which are the minimum and maximum orders the allocator supports. Your allocator will have min_order 12, so every allocation request will use at least a page of memory. (Note that this increases the maximum internal fragmentation.) It should have max_order at least 21, allowing it to support contiguous allocations of up to 2MiB.

Your allocator will maintain at least the following state.

An array that tracks the state of every physical page in the system. Specifically, for physical page number p (i.e., the page with physical address p * PAGESIZE) that is the beginning of an allocated or free block, pages[p] should say:
- Whether the page is free or not.
- The order of the block.
One free list per order.
- The list heads will be global.
- You will also need a way to link free blocks together.

The pages array lets you quickly check whether a buddy is free. The free lists lets you quickly find a free block of a given order. Both these operations are O(1), and there are a constant number of orders, so both kalloc and kfree should take O(1) time.

You own the definition of the page structure and the pages array (and you can call the array something else if you want). You will add more information to the array later.

Allocation and sanitizers

Integrate your buddy allocator with Chickadee’s sanitizer support. This will help the sanitizers warn on bugs like double-free and wild-write errors.

Upon allocating B bytes of memory at physical address pa, your kalloc function should call asan_mark_memory(pa, B, false). This tells the sanitizer that B bytes starting at physical address pa are valid to access.
Upon freeing a block with order o at physical address pa, your kfree function should call asan_mark_memory(pa, 1 << o, true). This tells the sanitizer to poison the freed block: accesses to its memory will cause warnings.

The asan_mark_memory function calls will do nothing unless sanitization is enabled.

Testing

Buddy allocation has tricky corner cases, and a testing strategy will be important for success. We expect you to write tests for your buddy allocator as assertions in kalloc and kfree, and/or in the test_kalloc function. Test failures should appear as assertion failures that panic the kernel.

A successful testing strategy has two parts: make bugs detectable, and make them frequent. Your code should make bugs easy to detect (it works best if you design your code with testing in mind), and your tests should be so comprehensive that even rare bugs are likely to occur during testing.

The classic way to make bugs detectable involves invariants and assertions. An invariant is a property that should always hold of correct code; an assertion is an executable statement that checks an invariant. Buddy allocators have many useful invariants; for instance, if an order-o block is free, then either o == max_order or the block’s order-o buddy is wholly or partially allocated. Assertions of this invariant in the kalloc code can potentially detect many bugs. Those assertions can work locally, a bit at a time, or globally, considering all available data. For instance, a local assertion about free blocks might run once, in kfree, to check that the freed block meets the invarant after block coalescing; a global assertion about free blocks might check every free block in the system by searching free lists. Local invariants are typically more efficient to check, and the best ones are so efficient that there’s no harm leaving them on in production builds. Global invariants may be many times more expensive, but they typically catch more bugs.

A good way to make bugs frequent is to use random testing, such as a long sequence of random allocations and frees. Random testing requires art, however: the tester must not do anything illegal (such as a double-free), and naive random testing can “get stuck” in conceptually-uninteresting states without exploring corner cases.

The test_kalloc function runs as part of the boot sequence when you give the SELFCHECK=1 argument to make.

You may be interested in several testing- and assertion-related blog posts by John Regehr of the University of Utah:

Hints

There are some useful math functions in lib.hh. You might especially like msb(x), which returns the index of the most significant 1 bit in x plus one. (For example, msb(1) == 1; msb(4095) == 12; msb(4096) == 13. As a special case, msb(0) == 0.) There are also round_up_pow2 and round_down_pow2.

There is also a generic doubly linked list type in k-list.hh. This would be useful for building free lists; or you can build your own.

Make sure you add assertions to your code. You’ll probably also want some helper functions, such as a function that returns a block’s order-o buddy, and a function that confirms that a physical address is valid as an order-o block.

Slab allocation (advanced optional work)

This style of buddy allocator is great for large objects, but because it relies on a pages array with information about each min_order block, it doesn’t scale to small allocation sizes (e.g., 64-byte allocations). For smaller allocations, most systems switch to another allocation strategy, such as slab allocation. Design and implement such an allocator.

B. Exit

Implement the sys_exit system call.

Preparation

Change process_setup to map the console at address 0xB8000 (i.e., ktext2pa(console)) in the initial process by default.

Implementing exit

Your initial version of exit should do the following:

Remove the current process from the process table.
Free all memory associated with the current process (which will exercise your buddy allocator from Part A).

When you’re done, make run-allocexit should show an endless explosion of different colors as new processes are created and old processes exit (freeing memory). There should be no leaks. (A leak will appear as a “free” physical page—a period in the physical page map—that is never reused.)

To make exit work in Chickadee, you’ll need to obey several invariants.

The process table ptable is protected by the ptable_lock.
The current stack page must not be free.

This means that a kernel task, such as the kernel half of a process, cannot fully free itself! (When the kernel is running on behalf of a process, the kernel’s stack pointer is in that process’s stack page.) The kernel task responsible for the exiting process must delegate its final freeing to some other logical thread of execution in the kernel. You could delay this final freeing until you implement Part E (sys_waitpid), but if you delay, run-allocexit will create at most 15 processes, and you’ll miss some potential bugs until Part E is done. Alternately, there is a kernel function that runs on its own, not on behalf of any kernel task, and thus on a different kind of stack. You can change that function to wipe out exited processes!
The current page table must not be free.
The memviewer examines process page tables, so synchronization is required when freeing process page tables. You can accomplish this one of three ways:
- The memviewer holds the ptable_lock, so you may free page tables while holding the ptable_lock.
- The memviewer only examines processes in ptable, so you may free the page table of a process that is no longer in ptable.
- It is safe to free page tables if the process’s “page table lock” is held. Our handout code has stub functions for implementing a page table lock; you’d need to actually implement them.
Note that adding mappings to a page table works fine, even concurrently with the memviewer. It’s just erasing mappings and especially freeing page tables that cause problems.

And you’ll probably need to change your code for sys_fork, and possibly other code, to avoid leaks.

C. Sleep

Implement the sys_msleep system call, which sleeps (i.e., does not return to its caller) until at least N milliseconds have passed.

For now, sys_msleep should return 0. In a future part, you will add support for system call interruption, and sys_msleep will return the error constant E_INTR if it is interrupted.

To add the system call, you’ll add a system call number to lib.hh, complete the user-level wrapper in u-lib.hh, and add an implementation case to proc::syscall. The kernel implementation will use the kernel’s ticks variable, which is a global variable incremented by the timer interrupt handler every 0.01 seconds. (So you’ll want to round the sleep argument up to the nearest multiple of 10 milliseconds.)

Use make run-testmsleep to test your work.

D. Parent processes

Implement parent processes and the sys_getppid system call.

In Unix, every process has a parent; the getppid system call returns the parent’s process ID. If a parent exits before a child, the child is reparented to have parent process ID 1. Process ID 1 is special: that process, called the init process, corresponds to an unkillable kernel task. Chickadee follows this design.

First implement the easy cases. These can be completed before you figure out performance and synchronization.

Add a parent process ID member to struct proc.
Add the system call: add the system call number to lib.hh, complete the user-level wrapper in u-lib.hh, and add an implementation case to proc::syscall.
Maintain parent process IDs correctly in fork.

Use make run-testppid to test your work. You should succeed through the “tests without exit succeeded” line. (The test program depends on a working sys_msleep implementation from Part C, though the rest of the code in this part is independent of sys_msleep.)

Implementation note: As you add content to struct proc, you may need to update your pset 1 stack canary code, especially if you hard-wired offset constants into assembly. Try to keep your stack canaries working as you go through the problem sets; they have caught real bugs in prior students’ code!

Then, implement exit and process reparenting.

Add a kernel initialization step that creates the init process. Ensure it has pid 1 and ppid 1. Its initial code can be a loop that calls proc::yield().
Change kernel_start so that the initial user process has pid 2 and ppid 1.
Maintain parent process IDs correctly in exit. (This is the hard case.)

Write up in pset2answers.md:

The design of your per-process metadata (see Performance, below).
Your synchronization invariants or other synchronization plan (see Synchronization, below).

Performance

When a process exits, the kernel must reparent its children. This should take \(O(C)\) time for a process with \(C\) children (rather than, say, \(O(P)\) time, where \(P\) is the total number of processes in the system).

Implementing \(O(C)\) exit will require tracking additional per-process metadata, which your writeup should describe.

Synchronization

Chickadee processes so far have been quite isolated: no action performed by one running process ever accessed another running process’s state. Here is a visualization of kernel programming under these conditions [1]:

Starting in this portion, actions taken by one running process can affect other running processes. And Chickadee is a multiprocessor kernel, so these processes might be running simultaneously, in kernel mode, on different CPUs. What happens if one process exits while one of its children is calling sys_getppid? What happens if a process and one of its children are exiting simultaneously? It is incredibly important to implement correct synchronization or the kernel could break. Here is a visualization of kernel programming under these conditions [2]:

Correct synchronization requires planning ahead. A good approach is to write down synchronization invariants: what locks are required to examine or modify particular data items? Chickadee already has a couple synchronization invariants; you will now start adding your own.

The simplest invariants are usually the most coarse-grained. (Early multicore operating systems used a giant lock approach, or “big kernel lock,” in which the entire kernel was protected by a single lock!) For instance, perhaps any change to the process hierarchy should require the ptable_lock. Finer-grained invariants are also possible when you reason about state. For example, a global process_hierarchy_lock could protect access to the process hierarchy (that is, process parent and child links). This is somewhat more scalable than using ptable_lock for everything since some accesses to ptable don’t require the process_hierarchy_lock, and some accesses to the process hierarchy don’t require the ptable_lock. Or you can have a lock per process, if you can make it work. (For what it’s worth, Linux uses a global “task list lock”, which is like a process_hierarchy_lock, but implements it as a readers-writer lock rather than a spinlock. The super-scable sv6 operating system uses a lock per process. Remember that correctness matters way more than performance for us.)

Describe your synchronization invariants in your writeup.

E. Wait and exit status

Implement the sys_waitpid system call and the exit status argument to sys_exit. Use the p-testwaitpid and p-testzombie programs to check your work.

Specification

Chickadee sys_waitpid implements a subset of the Unix waitpid behavior. It implements blocking and polling, exit status return, and selective waiting, but does not (yet) implement signal reporting, job control, or process groups.

pid_t sys_waitpid(pid_t pid, int* stat = nullptr, int options = 0)

pid names the child process to wait for. If pid == 0, then the system call waits for any child (it returns the first child to exit). If pid != 0 and none of this process’s children have that pid, or if pid == 0 and this process has no children at all, then the system call returns E_CHILD.
options determines whether the system call should block until a child exits. If it is 0, then the system call should block; if it is W_NOHANG, then the system call should return E_AGAIN rather than blocking if no relevant child has exited.
If the system call succeeds, then its return value is the process ID of the exited child, and (unless stat == nullptr) *stat is set to the child’s exit status (i.e., the argument given to sys_exit).
It is only possible to wait for a child’s status once. After sys_waitpid successfully reports a child’s exit status, the relevant PID becomes available for reuse. (The PID is not available for reuse before that point.)

Implementation notes

System calls and assembly functions can define their own calling conventions. We recommend that the kernel return both return value and exit status in one register (possible since both process IDs and statuses are 32-bit ints); the user-level system call wrapper in u-lib.hh should unpack that register and do the right things with its parts.

sys_waitpid will change the mechanism for freeing exited processes. Exited processes still become non-runnable, but their exit status must be preserved, and their PIDs not reused, until their parent calls sys_waitpid. You’ll need to add some stuff to struct proc. You’ll also be able to remove stuff from cpustate::schedule that you added in Part B (what?).

The p-testwaitpid program tests W_NOHANG functionality before it tests blocking functionality, so you can test a simple, nonblocking version of the system call first. The p-testzombie program also tests important sys_waitpid functionality.

Since waitpid involves cross-process communication, it requires synchronization. Your writeup should describe your synchronization invariants.

F. True blocking

Chickadee now has two blocking system calls, sys_msleep and sys_waitpid, but (unless you were real ambitious) neither of these system calls has a truly blocking implementation. User processes calling these system calls appear to block, but their corresponding kernel tasks remain runnable and repeatedly yield. This is easy to program but inefficient. It’s far better for the kernel tasks corresponding to blocked processes to become non-runnable.

The problem can be quantified. We added a counter to each process that counts how many times that process resumes. With a proc::yield–based implementation of sys_msleep, the p-testmsleep program results in more than 100,000 total resume events!

In this problem you will add true blocking to Chickadee, while avoiding the dreaded sleep-wakeup race condition, also known as the lost wakeup problem.

Lost wakeups in `sys_waitpid`

To understand lost wakeups, consider broken pseudocode like this for waitpid:

waitpid(proc* parent, ...) {
    ...
    while (/* no child is ready */) {
        parent->state_ = proc::blocked;
        parent->yield();
    }
    ...
}

exit(proc* p, ...) {
    ...
    proc* parent = ptable[p->ppid_];
    // wake up the parent; locking elided
    parent->state_ = proc::runnable;
    cpus[parent->cpu_].enqueue(parent);
    ...
}

Looks OK, right? waitpid blocks by setting proc::state_ to blocked; exit unblocks the parent by setting its state_ to runnable and enqueuing it on its home CPU. Well, it’s not OK. Consider the following sequence of events, where the parent calls waitpid on one CPU while one of its children calls exit on another:

waitpid:                                    exit:
                                                parent->state_ = proc::runnable;
    parent->state_ = proc::blocked;
    parent->yield();
    CPU dequeues `parent`, doesn’t run it
                                                cpus[0].enqueue(parent);
    ...later, CPU dequeues `parent` again,
    but still doesn’t run it, since its
    `state_` is still `blocked`!

The parent blocks indefinitely, even though it has a child with status ready to collect. There’s a race condition between sleeping (waitpid) and wakeup (exit). Correctness requires that when wakeup and sleep occur simultaneously, wakeup must win. But there’s nothing in the code to enforce that order.

Therefore, the sleeping and wakeup code must synchronize using some lock. The sleeping code assigns proc::state_ to blocked with the lock held, and validates that it should go to sleep before releasing that lock. But what lock?

Wait queues

It’s obvious in exit what process to wake: it is the unique parent process of the exiting process. Blocking in sys_msleep is different: which processes should the timer interrupt wake up? We need a queue of processes waiting for a timer interrupt.

Operating system kernels are full of different conditions that must be awaited, so kernels provide data types that simplify the process of waiting. You’ll implement a wait queue data type. To introduce a condition, the kernel declares a wait_queue. A sleeping kernel thread initializes and manipulates a waiter object, in the following way:

waiter w(this);
while (1) {
    w.prepare(&waitq);
    if (this process should wake up) {
        break;
    }
    w.block();
}
w.clear();

The wait_queue object contains a spinlock for synchronization and a list of blocked waiters. The waiter object stores the wait queue and the blocked process, allowing wakeup code to find and wake up every waiting process.

Read more about wait queues

Implementation

procs should remember what CPU they are running on.
Add proc::wake that unblocks this proc and schedules it on its home CPU.

Your method should do nothing for procs with state_ != proc::blocked. You’ll need to reason about synchronization here!
Implement correctly-synchronized true blocking for sys_waitpid. Check that the tests still work; use your own tests to validate that blocking is working correctly (i.e., that there are many fewer proc::resume calls).
Implement a wait queue abstraction. You can start from our skeleton in k-wait.hh or write your own.
Use your wait queue to implement correctly-synchronized true blocking for sys_msleep. Where is the sleeping code? Where is the wakeup code? As before, check that the tests still work and that the number of proc::resume calls has reduced.
A single wait queue for timer interrupts leads to lots of unnecessary wakeups. To reduce this expense, it’s data structure time. Try implementing a timing wheel: \(N\) wait queues, where queue \(i\) holds all waiters that should be awoken at any time \(\equiv i \pmod N\). (Timing wheels blog post; “Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility”) Or get really fancy and implement a timer heap (Go example; libev uses a 4-heap, which has better cache behavior).

G. System call interruption

In the last part of the problem set, you will implement system call interruption. Specifically, if a child exits while a parent is blocked in sys_msleep, then the parent’s sys_msleep should return early with the E_INTR return code.

Use make run-testeintr to check your work.

Note: This part of the problem set models the Unix signal mechanism. Unfortunately, as we discussed in CS61, the signal mechanism has inherent race conditions that are difficult to solve. It is OK for your interruption mechanism to have similar race conditions as Unix signals. For instance:

If a child exits immediately before the parent calls sys_msleep, then sys_msleep need not be interrupted.
If a child exits while the parent is processing sys_msleep, but before sys_msleep decides to block, then sys_msleep need not be interrupted.

However, if a child exits after sys_msleep decides to block, then sys_msleep must return early with the E_INTR return code.

Hint: You may need to add extra member variables to struct proc.

Turnin

Fill out psets/pset2answers.md and psets/pset2collab.md and push to GitHub. Then submit on the grading server.

Intermediate checkin 1: Turn in parts A–B by 11:59pm Monday 2/18.
Intermediate checkin 2: Turn in parts C–D by 11:59pm Monday 2/25.
Final checkin: Turn in all parts by 11:59pm Monday 3/4.

Note about intermediate checkins: We do not grade intermediate checkins; their purpose is simply to keep you working and on track. If you want to discuss your code after an intermediate checkin, make sure it’s pushed to the grading server and see us during office hours.

Problem set 2: Buddy allocation and wait queues