A. Kernel buddy allocator
The existing Chickadee allocator, kallocpage()
, can only allocate memory in
units of single pages, and it cannot free memory. That’s bad!
Freeing memory is not too hard to support; a singly-linked free list of pages
could work. But we seek an efficient allocator for variable-sized
allocations. This is important for kernel functionality; for instance, Chickadee
kernel stacks are limited to 4000
bytes or so because struct proc
s are allocated in units of single pages.
In this part, you will therefore implement a buddy allocator for kernel memory.
Complete the following functions in k-alloc.cc
:
void* kalloc(size_t sz)
void kfree(void* ptr)
void init_kalloc()
void test_kalloc()
kalloc
and kfree
are the malloc
and free
equivalents. init_kalloc()
should perform any initialization necessary so that the allocator has access
to all available physical memory (i.e., physical memory with type
mem_available
in physical_ranges
), and test_kalloc()
should perform tests rigorous enough to convince the course staff that your
allocator works correctly.
Then replace your kallocpage()
implementation with this:
x86_64_page* kallocpage() {
return reinterpret_cast<x86_64_page*>(kalloc(PAGESIZE));
}
and make sure Chickadee still works.
Buddy allocation
The buddy allocation algorithm is one of the oldest general algorithms for memory allocation. (It was invented in 1963!) Buddy allocation supports efficient splitting (breaking large contiguous blocks of free memory into smaller pieces) and coalescing (merging adjacent free blocks into larger contiguous blocks).
Buddy allocation is based on powers of 2 called orders. An allocation request of size \(s\) uses a block of size \(2^o\) and order \(o\), where
\[o = \left\lceil \log_2 s \right\rceil.\]
For example, an allocation request of size 4095 will use a block of size 4096 and order 12, but an allocation request of size 4097 will use a block of size 8192 and order 13. The physical address of a block of order \(o\) is always a multiple of \(2^o\).
Allocation splits large blocks until there’s a free block with the right size.
- Let
o
be the desired order for the allocation. - Find a free block with order
o
. If found, return that block. - Otherwise, find a free block with order
x > o
, minimizingx
. If found:- Split the block in two, creating two free blocks with order
x-1
. These two blocks are called order-(x-1)
buddies because they are adjacent blocks with orderx-1
, and the address of one is easily calculated from the address of the other. - Repeat step 2.
- Split the block in two, creating two free blocks with order
- If there’s no larger free block, the allocation fails: return
nullptr
.
Freeing coalesces freed buddies whenever possible. This can create large contiguous free blocks very efficiently.
- Let
o
be the order of the freed block. - Check whether the freed block’s order-
o
buddy is also completely free. If it is:- Coalesce the freed block and the buddy into a single free block of order
o + 1
. - Repeat step 2.
- Coalesce the freed block and the buddy into a single free block of order
Buddy allocators have low external fragmentation (unusable space between allocated blocks) because they aggressively coalesce freed memory into larger blocks. Their internal fragmentation (unused space inside allocated blocks) is at most 50%. Many real-world allocators use buddy allocation, often coupled with other techniques (for, e.g., efficient multicore support); for instance, both Linux and jemalloc use buddy allocation.
Designing your allocator
Your buddy allocator will use two constant parameters, min_order
and
max_order
, which are the minimum and maximum orders the allocator supports.
Your allocator will have min_order
12, so every allocation request will use
at least a page of memory. (Note that this increases the maximum internal
fragmentation.) It should have max_order
at least 21, allowing it to support
contiguous allocations of up to 2MiB.
Your allocator will maintain at least the following state.
- An array that tracks the state of every physical page in the system.
Specifically, for physical page number
p
(i.e., the page with physical addressp * PAGESIZE
) that is the beginning of an allocated or free block,pages[p]
should say:- Whether the page is free or not.
- The order of the block.
- One free list per order.
- The list heads will be global.
- You will also need a way to link free blocks together.
The pages
array lets you quickly check whether a buddy is free. The free
lists lets you quickly find a free block of a given order. Both these
operations are O(1), and there are a constant number of orders, so both
kalloc
and kfree
should take O(1) time.
You own the definition of the page structure and the pages
array (and you
can call the array something else if you want). You will add more information
to the array later.
Hints
We’ve added some useful math functions to lib.hh
. You might especially like
msb(x)
, which returns the index of the most significant 1 bit in
x
plus one. (For example, msb(1) == 1
; msb(4095) == 12
; msb(4096)
== 13
. As a special case, msb(0) == 0
.) There are also roundup_pow2
and
rounddown_pow2
.
We’ve also added a generic doubly linked list type in k-list.hh
.
This would be useful for building free lists; or you can build your own.
Make sure you add assertions to your code. You’ll probably also want some
helper functions, such as a function that returns a block’s order-o
buddy,
and a function that confirms that a physical address is valid as an order-o
block.
Slab allocation (advanced optional work)
This style of buddy allocator is great for large objects, but because it
relies on a pages
array with information about each min_order
block, it
doesn’t scale to small allocation sizes (e.g., 64-byte allocations). For
smaller allocations, most systems switch to another allocation strategy, such
as slab allocation. Design and implement such an allocator.
B. Exit
Implement the sys_exit
system call.
Preparation
- Merge our latest code via
git pull handout master
or equivalent. (You want91a2b361336b917253082ea3013e4b2ccf228a1b
or later.) Make sure your kernel still runs! - Change
process_setup
to map the console at address 0xB8000 (i.e.,ktext2pa(console)
) in the initial process by default.
Implementing exit
Your initial version of exit
should do the following:
- Remove the current process from the process table.
- Free all memory associated with the current process (which will exercise your buddy allocator from Part A).
When you’re done, make run-allocexit
should show an endless explosion of different
colors as new processes are created and old processes exit (freeing memory).
There should be no leaks. (A leak will appear as a “free” physical page—a
period in the physical page map—that is never reused.)
To make exit
work in Chickadee, you’ll need to obey several invariants.
- The process table is protected by the
ptable_lock
. - The current stack page must not be free. This means that a kernel task, such as the kernel half of a process, cannot fully free itself! (When the kernel is running on behalf of a process, the kernel’s stack pointer is in that process’s stack page.) The kernel task responsible for the exiting process must delegate its final freeing to some other logical thread of execution in the kernel. What design will you choose?
- The current page table must not be free.
The memviewer examines process page tables, so synchronization is required when freeing process page tables. You can accomplish this one of three ways:
- It is safe to free page tables while the
ptable_lock
is held. - It is safe to free page tables from a process not present in
ptable
. - It is safe to free page tables if the process’s “page table lock” is held. Our handout code has stub functions for implementing a page table lock; you’d need to actually implement them.
Note that adding mappings to a page table works fine, even concurrently with the memviewer. It’s just erasing mappings and especially freeing page tables that cause problems.
- It is safe to free page tables while the
And you’ll probably need to change your code for sys_fork
, and possibly
other code, to avoid leaks.
C. Sleep
Implement the sys_msleep
system call, which sleeps (i.e., does not return to
its caller) until at least N
milliseconds have passed.
For now, sys_msleep
should return 0. In a future part, you will add support
for system call interruption, and sys_msleep
will return the error constant
E_INTR
if it is interrupted.
To add the system call, you’ll add a system call number to lib.hh
, complete
the user-level wrapper in p-lib.hh
, and add an implementation case to
proc::syscall
. The kernel implementation will use the kernel’s ticks
variable, which is a global variable incremented by the timer interrupt
handler every 0.01 seconds. (So you’ll want to round the sleep argument up to
the nearest multiple of 10 milliseconds.)
Use make run-testmsleep
to test your work.
Warning! If initial implementation fails, see Lecture 7.
D. Parent processes
Implement parent processes and the sys_getppid
system call.
In Unix, every process has a parent; the getppid
system call returns the
parent’s process ID. If a parent exits before a child, the child is reparented
to have parent process ID 1. Process ID 1 is special; that process, called the
init process, corresponds to an unkillable kernel task. Chickadee follows
this design.
First implement the easy cases. These can be completed before you figure out performance and synchronization.
- Add a parent process ID member to
struct proc
. - Add the system call: add the system call number to
lib.hh
, complete the user-level wrapper inp-lib.hh
, and add an implementation case toproc::syscall
. - Maintain parent process IDs correctly in
fork
.
Use make run-testppid
to test your work. You
should succeed through the “tests without exit succeeded” line. (The test
program depends on a working sys_msleep
implementation from Part C, though
the rest of the code in this part is independent of sys_msleep
.)
Implementation note: As you add content to struct proc
, you may need to
update your pset 1 stack canary code, especially if you hard-wired
offset constants into assembly. It would be OK to disable your stack canary
code for now (though as you make struct proc
bigger and add kernel
functionality, the stack canary might be useful!).
Then, implement exit
and process reparenting.
- Add a kernel initialization step that creates the init process. Ensure it
has pid 1 and ppid 1. Its initial code can be a loop that calls
proc::yield()
. - Change
kernel_start
so that the initial user process has pid 2 and ppid 1. - Maintain parent process IDs correctly in
exit
. (This is the hard case.)
Write up in pset2answers.md
:
- The design of your per-process metadata (see Performance, below).
- Your synchronization invariants or other synchronization plan (see Synchronization, below).
Performance
When a process exits, the kernel must reparent its children. This should take \(O(C)\) time for a process with \(C\) children (rather than, say, \(O(P)\) time, where \(P\) is the total number of processes in the system).
Implementing \(O(C)\) exit will require tracking additional per-process metadata, which your writeup should describe.
Synchronization
Chickadee processes so far have been quite isolated: no action performed by one running process ever accessed another running process’s state. Here is a visualization of kernel programming under these conditions [1]:
Starting in this portion, actions taken by one running process can affect
other running processes. And Chickadee is a multiprocessor kernel, so these
processes might be running simultaneously, in kernel mode, on different CPUs.
What happens if one process exits while one of its children is calling
sys_getppid
? What happens if a process and one of its children are exiting
simultaneously? It is incredibly important to implement correct
synchronization or the kernel could break. Here is a visualization of kernel
programming under these conditions [2]:
Correct synchronization requires planning ahead. A good approach is to write down synchronization invariants: what locks are required to examine or modify particular data items? Chickadee already has a couple synchronization invariants; you will now start adding your own.
The simplest invariants are usually the most coarse-grained. (Early multicore
operating systems used a giant lock
approach, or “big kernel lock,” in
which the entire kernel was protected by a single lock!) For instance,
perhaps any change to the process hierarchy should require the ptable_lock
.”
Finer-grained invariants are also possible when you reason about state by
purpose. For example, a global process_hierarchy_lock
could protect access
to the process hierarchy (that is, process parent and child links). This is
somewhat more scalable than using ptable_lock
for everything since some
accesses to ptable
don’t require the process_hierarchy_lock
, and some
accesses to the process hierarchy don’t require the ptable_lock
. Or you can
have a lock per process, if you can make it work. (For what it’s worth, Linux
uses a global “task list lock”, which is like a process_hierarchy_lock
, but
implements it as a readers-writer
lock rather than a
spinlock. The super-scable sv6 operating
system uses a
lock per process. Remember that correctness matters way more than performance
for us nows.)
Describe your synchronization invariants in your writeup.
E. Wait and exit status
Implement the sys_waitpid
system call and the exit status argument to
sys_exit
. Use the p-testwaitpid
and p-testzombie
programs to check your
work.
Specification
Chickadee sys_waitpid
implements a subset of the Unix waitpid
behavior. It implements blocking and polling, exit status return, and
selective waiting, but does not (yet) implement signal reporting, job control,
or process groups.
pid_t sys_waitpid(pid_t pid, int* stat = nullptr, int options = 0)
pid
names the child process to wait for; ifpid == 0
, then the system call waits for any child (it returns the first child to exit). Ifpid != 0
and none of this process’s children have that pid, or ifpid == 0
and this process has no children at all, then the system call returnsE_CHILD
.options
determines whether the system call should block until a child exits. If it is 0, then the system call should block; if it isW_NOHANG
, then the system call should returnE_AGAIN
rather than blocking if no relevant child has exited.- If the system call succeeds, then its return value is the process ID of the
exited child, and (unless
stat == nullptr
)*stat
is set to the child’s exit status (i.e., the argument given tosys_exit
). - It is only possible to wait for a child’s status once. After
sys_waitpid
successfully reports a child’s exit status, the relevant PID becomes available for reuse. (The PID is not available for reuse before that point.)
Implementation notes
Remember that system calls and assembly functions can define their own calling
conventions. We recommend that the kernel use another register to return the
exit status to the user-level wrapper, such as %rcx
. The user-level
wrapper, rather than the kernel, will move the resulting value to *stat
.
This sidesteps issues with cross-context memory checking (the kernel must
validate any pointer provided by untrusted user code). Look to p-lib.hh
for
examples of how to associate registers with values.
sys_waitpid
will change the mechanism for freeing exited processes. Exited
processes still become non-runnable, but their exit status must be preserved,
and their PIDs not reused, until their parent calls sys_waitpid
. You’ll need
to add some stuff to struct proc
. You’ll also be able to remove stuff from
cpustate::schedule
that you added in Part B (what?).
The p-testwaitpid
program tests W_NOHANG
functionality before it tests
blocking functionality, so you can test a simple, nonblocking version of the
system call first. The p-testzombie
program also tests important
sys_waitpid
functionality.
Since waitpid
involves cross-process communication, it requires
synchronization. Your writeup should describe your synchronization invariants.
F. True blocking
Chickadee now has two blocking system calls, sys_msleep
and sys_waitpid
,
but (unless you were real ambitious) neither of these system calls has a truly
blocking implementation. User processes calling these system calls appear to
block, but their corresponding kernel tasks remain runnable and repeatedly
yield. This is easy to program but inefficient. It’s far better for the kernel
tasks corresponding to blocked processes to become non-runnable.
The problem can be quantified. We added a counter to each process that counts
how many times that process resumes. With a proc::yield
–based implementation
of sys_msleep
, the p-testmsleep
program results in more than 100,000 total
resume events!
In this problem you will add true blocking to Chickadee, while avoiding the dreaded sleep-wakeup race condition, also known as the lost wakeup problem.
Lost wakeups in sys_waitpid
To understand lost wakeups, consider broken pseudocode like this for waitpid
:
waitpid(proc* parent, ...) {
...
while (/* no child is ready */) {
parent->state_ = proc::blocked;
parent->yield();
}
...
}
exit(proc* p, ...) {
...
proc* parent = ptable[p->ppid_];
// wake up the parent; locking elided
parent->state_ = proc::runnable;
cpus[parent->cpu_].enqueue(parent);
...
}
Looks OK, right? waitpid
blocks by setting proc::state_
to blocked
;
exit
unblocks the parent by setting its state_
to runnable
and enqueuing
it on its home CPU. Well, it’s not OK. Consider the following sequence of
events, where the parent calls waitpid
on one CPU while one of its children
calls exit
on another:
waitpid: exit:
parent->state_ = proc::runnable;
parent->state_ = proc::blocked;
parent->yield();
CPU dequeues `parent`, doesn’t run
it
cpus[0].enqueue(parent);
...later, CPU dequeues `parent`
again, but again doesn’t run it,
since its state_ is still blocked!
The parent blocks indefinitely, even though it has a child with status ready
to collect. There’s a race condition between sleeping (waitpid
) and wakeup
(exit
). Correctness requires that when wakeup and sleep occur
simultaneously, wakeup must win. But there’s nothing in the code to enforce
that order.
Locks to the rescue: the sleeping and wakeup code must synchronize using some
lock. The sleeping code assigns proc::state_
to blocked
with the lock
held, and validates that it should go to sleep before releasing that lock.
But what lock?
Wait queues
It’s obvious in exit
what process to wake: it is the unique parent process
of the exiting process. Blocking in sys_msleep
is different: which processes
should the timer interrupt wake up? We need a queue of processes waiting for
a timer interrupt.
Operating system kernels are full of different conditions that must be
awaited, so kernels provide data types that simplify the process of waiting.
You’ll implement a wait queue data type. To introduce a condition, the
kernel declares a wait_queue
. A sleeping kernel thread initializes and
manipulates a waiter
object, in the following way:
waiter w(this);
while (1) {
w.prepare(&waitq);
if (this process should wake up) {
break;
}
w.block();
}
w.clear();
The wait_queue
object contains a spinlock for synchronization and a list of
blocked waiter
s. The waiter
object stores the wait queue and the blocked
process, allowing wakeup code to find and wake up every waiting process.
Read more about wait queues in the Lecture 8 notes.
Implementation
proc
s should remember what CPU they are running on.Add
proc::wake
that unblocks thisproc
and schedules it on its home CPU.Your method should do nothing for
proc
s withstate_ != proc::blocked
. You’ll need to reason about synchronization here! Our initial implementation is in the Lecture 8 notes.Implement correctly-synchronized true blocking for
sys_waitpid
. Check that the tests still work; use your own tests to validate that blocking is working correctly (i.e., that there are many fewerproc::resume
calls).Implement a wait queue abstraction. You can start from our skeleton in
k-wait.hh
or write your own.Use your wait queue to implement correctly-synchronized true blocking for
sys_msleep
. Where is the sleeping code? Where is the wakeup code? As before, check that the tests still work and that the number ofproc::resume
calls has reduced.A single wait queue for timer interrupts leads to lots of unnecessary wakeups. To reduce this expense, it’s data structure time. Try implementing a timing wheel: \(N\) wait queues, where queue \(i\) holds all
waiter
s that should be awoken at any time \(\equiv i \pmod N\). (Timing wheels blog post; “Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility”) Or get really fancy and implement a timer heap (Go example; libev uses a 4-heap, which has better cache behavior).
G. System call interruption
In the last part of the problem set, you will implement system call
interruption. Specifically, if a child exits while a parent is blocked in
sys_msleep
, then the parent’s sys_msleep
should return early with the
E_INTR
return code.
Use make run-testeintr
to check your work.
Note: This part of the problem set models the Unix signal mechanism. Unfortunately, as we discussed in CS61, the signal mechanism has inherent race conditions that are difficult to solve. It is OK for your interruption mechanism to have similar race conditions as Unix signals. For instance:
- If a child exits immediately before the parent calls
sys_msleep
, thensys_msleep
need not be interrupted. - If a child exits while the parent is processing
sys_msleep
, but beforesys_msleep
decides to block, thensys_msleep
need not be interrupted.
However, if a child exits after sys_msleep
decides to block, then
sys_msleep
must return early with the E_INTR
return code.
Hint: You may need to add extra member variables to struct proc
.
Turnin
Fill out psets/pset2answers.md
and psets/pset2collab.md
and push
to GitHub. Then submit on
the grading server.
Intermediate checkin 1: Turn in part A by 11:59pm Monday 2/12.
Intermediate checkin 2: Turn in parts B–D by 11:59pm Monday 2/19.
Final checkin: Turn in all parts by 11:59pm Monday 2/26.