Problem set 1: Welcome and buddy allocation

These initial exercises get you acclimated to the Chickadee OS code and our documentation. They are focused on virtual memory.

Intermediate checkin: Turn in Parts A and B by 11:59pm Tuesday 2/4.
Final checkin: Turn in all parts by 11:59pm Wednesday 2/12.

Note about intermediate checkins: We do not grade intermediate checkins; they’re intended just to keep you working. If you want to discuss your code after an intermediate checkin, make sure it’s pushed to the grading server and see us during office hours.

Run it

Your first step is to create a personal fork of the Chickadee repository. Follow the setup instructions.

Then clone that Chickadee fork to your development environment (instructions below). When you’re ready, make run should pop up a window like this, with red “1”s gradually marching to the right:

Chickadee original state

Build instructions

The native Chickadee build environment is Linux. To use Linux, you can do one of three things:

  1. Use Docker;
  2. Prepare and install a Linux virtual machine on your preferred computer;
  3. or set up your own Linux machine.

For most students, Docker will be easiest. Though one can run Chickadee on Mac OS X or Windows WSL without Docker, we recommend you set up Docker first.

Docker instructions

  1. Download and install Docker from docker.com.

    On Linux, follow Docker’s instructions. On Mac, follow Docker’s instructions, but make sure you download the correct version for your computer (Apple Silicon (M1/M2/M3 Macs) need the Apple Chip version, not the Intel Chip version). On Windows, use the Windows instructions below.

  2. Clone your chickadee-s25 repository.

  3. Open a terminal and change into your chickadee-s25/docker directory. For instance:

    $ cd ~/chickadee-s25-kohler/docker
    
  4. Run this command. It will take a while—up to ten minutes.

    $ ./build-docker
    

    The command starts up a virtual Linux-based computer running inside your computer. Then, inside that virtual environment, it installs software useful for CS 1610, and takes a snapshot of that running environment. (The snapshot is called chickadee:latest.) Once the snapshot is created, it’ll take just a second or so for Docker to restart it.

  5. Change into the main chickadee-s25 directory and run ./run-docker. This will connect to the Chickadee Linux environment inside your terminal.

Virtual machine instructions

We use VMware Fusion for Mac machines, and anyone can install VMware Fusion Player for free for noncommercial use. There are other virtual machines, such as VirtualBox and Parallels, that some students prefer. On Windows machines, students have had good experiences using VirtualBox.

You need a recent version of your preferred Linux distribution because you want a recent compiler. Chickadee builds without additional configuration on GCC 13 or later. GCC 13 is the native compiler on Ubuntu 24.04. The Docker container uses Ubuntu 24.04.

Mac OS X instructions

You will need a cross-compiler and an installation of QEMU. Note that native Mac OS installation is not officially supported by the course staff. We recommend that you use Docker or install a virtual machine. But!—if you are adventurous, you can try the following, which should work on M1/M2 and x86 Macs:

  1. Install Homebrew.

  2. Install Homebrew’s GCC package, which at time of writing, installs gcc-14 and g++-14: brew install gcc

  3. Install Homebrew’s QEMU: brew install qemu

  4. Install a collection of cross-compilers: brew tap messense/homebrew-macos-cross-toolchains; brew install x86_64-unknown-linux-gnu

  5. Edit the file config.mk in your Chickadee directory to contain this:

    CCPREFIX=x86_64-unknown-linux-gnu-
    HOSTCC=gcc-14
    HOSTCXX=g++-14
    

    But do not add config.mk to your repository! It’s meant to configure the build to a specific working directory environment and, for instance, the grading server environment shouldn’t use those settings.

Chickadee overview

Chickadee is based on WeensyOS, so it may feel familiar to you. But there are important differences. Chickadee, unlike WeensyOS, is a multiprocessor operating system; and Chickadee, unlike WeensyOS, supports kernel task suspension: a Chickadee kernel processing a system call on behalf of one process can suspend itself, either voluntarily or because of an interrupt, and allow other tasks to run, resuming later in kernel space exactly where it left off. Kernel task suspension has profound effects on operating system structure.

A. Memory allocator

The Chickadee kernel memory allocator is located in k-alloc.cc. It supports allocating physical memory one page at a time. It doesn’t support larger or smaller allocations, or freeing memory—you’ll address those issues later.

Answer the following questions about k-alloc.cc. You may find Chickadee's online documentation useful; see the "Documentation" tab at the top of the course website. Write up your answers in the Markdown file psets/pset1answers.md; you will turn this file in.

  1. What is the maximum size supported by kalloc()?

  2. What is the first address returned by kalloc()? Use log_printf to determine the answer experimentally, then explain this answer from the code. (See Memory iterators for more information on the physical_ranges object.)

  3. What is the largest address returned by kalloc()? (You may want to reduce ALLOC_SLOWDOWN and disable the sys_pause call in p-allocator.cc so the allocation process runs faster, but don’t forget to undo these changes.)

  4. Does kalloc() return physical addresses, high canonical (kernel virtual) addresses, or kernel text addresses? (See Memory layout.) What line of code determines this?

  5. What one-line change to some .cc file other than k-alloc.cc would allow kalloc() to use 0x300000 bytes of physical memory? Test your answer.

  6. Rewrite the “skip over reserved and kernel memory” loop so that it is simpler. You will use physical_ranges.type() (and, perhaps, physical_ranges.limit()) rather than physical_ranges.find(). Paste your loop into your answers file.

  7. The loop using type() is simpler than the loop using find(), but the loop using find() has other advantages. What are they? Give a quantitative answer.

  8. What bad thing could happen if there was no page_lock?

Warning! Future problem sets will frequently ask you to allocate memory for a data structure, such as a proc or an array of files. You should generally not use kalloc to allocate such memory. This is because kalloc allocates plain memory—an array of bytes—without initializing that memory appropriately. Your code should generally allocate an object by calling knew<OBJECT_TYPE>(), passing constructor arguments as appropriate. knew() calls kalloc() under the hood, but it additionally initializes the allocated memory to be a valid object of the allocated type.

Make sure, also, that you match allocation functions with deletion functions. Memory allocated by kalloc should be freed by kfree, but memory allocated by knew should be freed instead by the C++ delete operator. delete deinitializes the object being deleted by calling its destructor, whereas kfree just deallocates the memory.

It’s still safe to call kalloc when plain memory is required, as in the implementation of SYS_PAGE_ALLOC.

B. Memory viewer

The Chickadee kernel displays physical and virtual memory maps on the console. In WeensyOS, the memory viewer used the explicit physpages array to track which physical pages were allocated. You’ll introduce a structure like this in Part G (the kernel buddy allocator). In the handout code, though, Chickadee computes a memory map using physical_ranges and process page tables.

Answer the following questions about memusage::refresh() in k-memviewer.cc.

  1. Which line of code marks physical page 0x100000 as kernel-restricted?

  2. Which line of code marks struct proc memory as kernel-restricted?

  3. The ptiter loop marks pages as kernel-restricted, whereas the vmiter loop marks pages as user-accessible. Why this distinction? What kinds of pages are involved in each case? What bad things could occur if the pages marked by the ptiter loop were user-accessible?

  4. All pages marked by the pid loop should have the same mem_ type constant (i.e., physical_ranges.type() will return the same value for each page). What is that memory type? Why is this true?

  5. In the vmiter loop, replace it.next() with it += PAGESIZE. What happens? Give a qualitative answer (what does QEMU do?) and a brief explanation.

  6. memusage::refresh() aims to capture all allocated memory, but it fails to do so: on the top line of the physical memory map you should see three periods ., indicating three allocated pages that memusage missed. (If you make run NCPU=X for X≠2, you’ll see more or fewer periods. Hm!) Where are those pages allocated, and (if you can tell) what are their purposes?

    Hint: Add debugging code to kalloc() to figure this out. The log_backtrace function may be useful, coupled with obj/kernel.sym and/or obj/kernel.asm.

  7. Change memusage::refresh() so that it captures all allocated pages. You should make the necessary updates in the code, and also explain what you did in pset1answers.md.

C. C++ entry transitions

This question is about C++ entry transitions, which we define as places in the code where a C++ function can gain control for the first time on a stack. Put another way, during a C++ entry transition, instructions compiled from C++ source start executing on a previously-empty stack, or on a stack whose contents were initialized by assembly functions or by the hardware. A good understanding of C++ entry transitions will serve you well as you work to understand Chickadee contexts and how kernels work.

Each process has just one C++ entry transition, namely from the kernel’s restore_and_iret to that process’s process_main. Here’s how that’s set up:

  1. The process.ld linker script marks process_main as the program’s entry point. This entry point is recorded in the ELF executable file.
  2. The proc_loader loader functions extract this entry point.
  3. The start_initial_process function in kernel.cc copies the entry point into the newly-initialized process’s rip register, so that the next proc::resume call (which uses assembly) transfers control there via the iret instruction.

So the iret instruction in restore_and_iret in k-exception.S transitions to C++ code at process_main.

Of course, C++ code in processes can resume at many other locations, such as after a system call instruction or after an interrupt. But those resumption locations use a stack that was already set up! The only C++ code that runs on a fresh stack is process_main’s entry point.

Question: The boot loader and kernel have seven C++ entry transitions in total. List them in pset1answers.md and briefly describe the circumstances under which they are called.

Hint: The first transition starts in the boot loader’s assembly code, bootentry.S. The other six start in the assembly code in k-exception.S; look for instructions that can change the instruction pointer, such as call, jmp, and iret. Five of them are pretty clear, but one may be a little harder to puzzle out. 🙃 Look for a while, then ask around.

Going further (optional): The System V Application Binary Interface for x86-64 machines determines the C/C++ calling convention for x86-64 programs on many operating systems, including Chickadee. It states that at function entry, the stack address should equal 16*N + 8 for some integer N. This means that when a compiled C++ function begins executing, the stack pointer should be 8 bytes off 16-byte alignment. Though most code will work correctly on any alignment, certain instructions will fail if the alignment is off.

Compilers and operating systems cooperate to enforce this alignment convention. The operating system should set up every C++ entry point to have the correct alignment. Thereafter, the compiler will preserve the correct alignment by making every stack frame a multiple of 16 bytes.

For optional extra work, assert at each C++ entry point that the stack is correctly aligned, and fix any alignment errors you find.

Hint: The Chickadee kernel and applications are compiled with -fno-omit-frame-pointer. This option tells GCC to maintain a base or frame pointer register that can be used to trace through stack frames. On x86-64, the %rbp register is the frame pointer. You’ll notice that all C++-compiled functions start and end with similar assembly code:

prologue: pushq %rbp
          movq %rsp, %rbp
          ... function body ...
epilogue: popq %rbp
          retq

(You might also see the epilogue leaveq; retq, which is equivalent to movq %rbp, %rsp; popq %rbp; retq.) As a result, every C++ function will have a similar stack layout:

~                   ~     ^ higher addresses
|      caller       |
|       data        |
+-------------------+
|   return address  |
+-------------------+  <- %rsp at function entry
|     saved %rbp    |
+-------------------+  <- current %rbp == (%rsp at entry) - 8
|  this function’s  |
|  locals, callee-  |
|  saved regs, etc. |
~   (may be empty)  ~
+-------------------+  <- current %rsp
|    “red zone”     |
|  (locals allowed  |
|   for 128 bytes)  |
~                   ~

In consequence:

  1. All of this function’s stack storage is located in addresses less than %rbp.
  2. %rbp should always be 16-byte aligned. (It is 8 bytes less than a value that is 8 bytes off of 16-byte aligned.)
  3. The value in (%rbp) (in C++ terms, *reinterpret_cast<uintptr_t*>(rdrbp())) is the caller’s %rbp. This lets us trace backwards through callers’ stack frames, as in log_backtrace.
  4. The value in 8(%rbp) (in C++ terms, *reinterpret_cast<uintptr_t*>(rdrbp() + 8)) is the return address.

D. Console fun

The Chickadee kernel and processes display information on an emulated CGA console in 80x25 text mode. This device parses the physical memory starting at address 0xB8000 as a textual screen. The byte at 0xB8000 + 2*(80*ROW + COL) determines the character to show at ROW,COL; its encoding is code page 437, which is compatible with ASCII. The byte at 0xB8001 + 2*(80*ROW + COL) determines the color of that character. Chickadee provides both kernel and processes direct access to the console via the uint16_t console[] array.

Add code to the start of p-allocator.cc that uses console to glam up the allocator. Purple stars would use the code below; surely you can come up with something cooler.

sys_map_console(console);
for (int i = 0; i < CONSOLE_ROWS * CONSOLE_COLUMNS; ++i) {
    console[i] = '*' | 0x5000;
}

E. Memory usage

This part asks you to introduce a new system call, sys_getusage(usage* u), that fills out a struct usage with statistics about the machine’s current state. This gives you some practice with checking system call arguments for security as well as navigating user and kernel code.

struct usage is defined as follows in lib.hh:

struct usage {
    unsigned long time;
    size_t free_pages;
    size_t allocated_pages;
};

The call sys_getusage(u) should set u->time to the current value of ticks (the kernel’s notion of how much time has elapsed since boot); set u->free_pages to the number of allocatable pages that remain free; set u->allocated_pages to the number of allocatable pages that have been allocated so far; and return 0. The kernel must also validate that u is safe to access; if u points to invalid memory (it has improper alignment or isn’t writable by the calling process), then the kernel should return E_FAULT without making any other changes.

You will need:

  1. A SYSCALL_GETUSAGE constant (accessible to both processes and kernel).

  2. A sys_getusage(usage* u) wrapper function that executes a syscall instruction with the proper environment (for processes).

  3. A kernel implementation for SYSCALL_GETUSAGE, including some code that extracts memory statistics from k-alloc.cc.

If you’re confused about where to put your new code, try git grep. For instance, try git grep SYSCALL or git grep sys_.

Run make run-testgetusage to test your work. This command line builds the kernel to execute p-testgetusage.cc instead of p-allocator.cc. Look at p-testgetusage.cc to understand its function.

F. Fork

The current kernel’s SYSCALL_FORK implementation always returns E_NOSYS. Now you will make a real fork implementation. Before starting to work on fork, you should read about contexts! You should also read about how to debug Chickadee using gdb and other techniques.

To fork a process, the kernel must start a new context. That involves:

  1. The kernel allocating a new PID.
  2. Allocating a struct proc and a page table.
  3. Copying the parent process’s user-accessible memory and mapping the copies into the new process’s page table.
  4. Initializing the new process’s registers to a copy of the old process’s registers.
  5. Storing the new process in the process table.
  6. Arranging for the new PID to be returned to the parent process and 0 to be returned to the child process.
  7. Enqueueing the new process on some CPU’s run queue.

If any of these steps fail except the first (because the system is out of PIDs or memory), fork should return an error to the parent. Chickadee doesn’t know how to free memory yet, so don’t worry about memory leaks.

Things to note:

Once you have implemented fork correctly, your make run-allocator memory map should feature four processes with different allocation speeds, just as in WeensyOS.

Run make run-testforksimple to test sys_fork’s return values. You should see this output (assuming you allocate PIDs in sequential order):

testforksimple 1 [2] 1 [3] 1 succeeded!
testforksimple 1 [0] 2 [4] 2 succeeded!
testforksimple 1 [2] 1 [0] 3 succeeded!
testforksimple 1 [0] 2 [0] 4 succeeded!

G. Stack canaries

Chickadee collocates kernel stack data, such as local variables and stored registers, with other data. Each page holding a struct proc contains a kernel task stack at its upper end, and each page holding a struct cpustate contains a kernel CPU stack at its upper end. (See Contexts.) This means that if a stack grows too much—because of too-deep recursion, or large local variables—the stack will collide with important data and corrupt the kernel.

  1. Demonstrate this problem by adding a system call that corrupts a kernel structure if it is called. (The corruption doesn’t need to be silent; for instance, the kernel might fail an assertion or eventually reboot.) Modify p-nastyalloc.cc to call this system call (make run-nastyalloc will run it).

  2. Add canaries to the relevant kernel structures to detect this problem and turn it into an assertion failure. A canary is a piece of data with a known value, placed in a location vulnerable to inadvertent corruption. To check the canary, you compare its current value with the expected value; any difference indicates disaster. Add code to set the canaries in relevant kernel structures and code to check them. For instance, you might check canaries before system calls return, or you might introduce a kernel task that periodically checks all system canaries.

    Your canary code should detect the problem induced by p-nastyalloc. It need not detect every possible problem, though (that would be impossible).

  3. You can use GCC options to detect some too-large stacks statically. Check out -Wstack-usage. Does -Wstack-usage detect the problem you added in step 1? (Optional advanced work: Could -fstack-usage detect it?)

H. Kernel buddy allocator

The existing Chickadee allocator can only allocate memory in units of single pages, and it cannot free memory. That’s bad!

Freeing memory is not too hard to support; a singly-linked free list of pages could work. But we seek an efficient allocator for variable-sized allocations. For instance, Chickadee kernel stacks are currently limited to a couple thousand bytes because struct procs are allocated in units of single pages: a buddy allocator would let us support deeper stacks. And some later hardware structures, such as the structure for disk communication, require larger amounts of contiguous memory.

In this part, you will therefore implement a buddy allocator for kernel memory. Complete the following functions in k-alloc.cc:

kalloc and kfree are the malloc and free equivalents. init_kalloc() should perform any initialization necessary so that the allocator has access to all available physical memory (i.e., physical memory with type mem_available in physical_ranges).

Since available physical memory is fragmented into several disjoint ranges, your buddy allocator’s initial state will contain multiple free blocks of different orders.

Also add testing infrastructure for your buddy allocator. Running make run-testkalloc should execute some allocator tests in the kernel. This will likely require adding a new system call as well as the tests themselves.

Testing is super important! Do not wait to implement tests—you should develop the tests in parallel with the allocator. See below for testing hints.

Buddy allocation

The buddy allocation algorithm is one of the oldest general algorithms for memory allocation. (It was invented in 1963!) Buddy allocation supports efficient splitting (breaking large contiguous blocks of free memory into smaller pieces) and coalescing (merging adjacent free blocks into larger contiguous blocks).

Buddy allocation is based on powers-of-2-sized blocks. Every allocatable block has an integer order \(o\). A block of order \(o\) always has size \(2^o\) and address evenly divisible by \(2^o\).

To service an \(s\)-byte allocation, the allocator finds a free block of order \(o\) so that \(2^o \geq s\). For example, a request of size 4095 would use a block of order 12 or more (since \(2^{12} = 4096 \geq 4095\)), but a request of size 4097 would use a block of order 13 or more (since \(2^{12} < 4097 \leq 2^{13}\)). Ideally the allocator will choose the smallest order that fits the requested size, but many implementations (including yours) will enforce a minimum order for all requests.

Allocation splits large blocks until there’s a free block with the right size.

  1. Let \(o\) be the desired order for the allocation.
  2. Find a free block with order \(x \geq o\), minimizing \(x\).
  3. If no block is found, then the allocation fails. Return nullptr.
  4. Otherwise, if \(x = o\), then the allocation succeeds. Mark the free block as allocated and return it.
  5. Otherwise, \(x > o\). Split the free block in half, creating two free blocks with order \(x-1\). These two blocks are called order-\((x-1)\) buddies because they are adjacent blocks with order \(x-1\), and the address of one is easily calculated from the address of the other. Then repeat from step 2.

Freeing coalesces freed buddies whenever possible. This can efficiently recreate large contiguous free blocks.

  1. Let \(o\) be the order of the free block.
  2. Check whether the free block’s order-\(o\) buddy is also completely free. If it is:
    1. Coalesce the free block and its buddy into a single free block of order \(o + 1\).
    2. Repeat from step 1 with this new free block.

Buddy allocators have low external fragmentation (unusable space between allocated blocks) because they aggressively coalesce freed memory into larger blocks. Their internal fragmentation (unused space inside allocated blocks) may be up to 50% (more, if there is a minimum order). Many real-world allocators use buddy allocation, often coupled with other techniques (for, e.g., efficient multicore support); for instance, both Linux and jemalloc use buddy allocation.

Designing your allocator

Your buddy allocator will use two constant parameters, min_order and max_order, which are the minimum and maximum orders the allocator supports. Its min_order should be 12. This means every successful allocation will use at least a page of memory—even kalloc(1) should allocate a page of memory! Its max_order should be at least 21, allowing it to support contiguous allocations of up to 2MiB.

Your implementation will be simplest if your blocks are aligned at 0, meaning that the physical address of an order-\(o\) block is always a multiple of \(2^o\). If blocks are aligned at 0, then a block starting at physical address 0x1000 can only have order 12 (because \(\texttt{0x1000} = 2^{12}\) is not an even multiple of any higher power of 2). That said, other designs are possible (some students have a different alignment for each mem_available range).

Your allocator will maintain at least the following state.

The pages array lets you quickly check whether a buddy is free. The free lists lets you quickly find a free block of a given order. Both these operations are O(1), and there are a constant number of orders, so both kalloc and kfree should take O(1) time.

You own the definition of the page structure and the pages array (and you can call the array something else if you want). You will add more information to the array later.

Allocation and sanitizers

Integrate your buddy allocator with Chickadee’s sanitizer support. This will help the sanitizers warn on bugs like double-free and wild-write errors.

The asan_mark_memory function calls will do nothing unless sanitization is enabled.

Testing

Buddy allocation has tricky corner cases, and a testing strategy will be important for success. We expect you to write tests for your buddy allocator as assertions in kalloc and kfree, and/or in the test_kalloc function. Test failures should appear as assertion failures that panic the kernel.

A successful testing strategy has two parts: make bugs detectable, and make them frequent. Your code should make bugs easy to detect (it works best if you design your code with testing in mind), and your tests should be so comprehensive that even rare bugs are likely to occur during testing.

The classic way to make bugs detectable involves invariants and assertions. An invariant is a property that should always hold of correct code; an assertion is an executable statement that checks an invariant. Buddy allocators have many useful invariants; for instance, if an order-o block is free, then either o == max_order or the block’s order-o buddy is wholly or partially allocated. Assertions of this invariant in the kalloc code can potentially detect many bugs. Those assertions can work locally, a bit at a time, or globally, considering all available data. For instance, a local assertion about free blocks might run once, in kfree, to check that the freed block meets the invarant after block coalescing; a global assertion about free blocks might check every free block in the system by searching free lists. Local invariants are typically more efficient to check, and the best ones are so efficient that there’s no harm leaving them on in production builds. Global invariants may be many times more expensive, but they typically catch more bugs.

A good way to make bugs frequent is to use random testing, such as a long sequence of random allocations and frees. Random testing requires art, however: the tester must not do anything illegal (such as a double-free), and naive random testing can “get stuck” in conceptually-uninteresting states without exploring corner cases.

You may be interested in several testing- and assertion-related blog posts by John Regehr of the University of Utah:

Hints

There are some useful math functions in lib.hh. They are documented there, but you might especially like:

There’s a generic doubly linked list type in k-list.hh. This would be useful for building free lists; or you can build your own.

Make sure you add assertions to your code. You’ll probably also want some helper functions, such as a function that returns a block’s order-o buddy, and a function that confirms that a physical address is valid as an order-o block.

Once you finish implementing and testing your buddy allocator, be sure to modify your fork() implementation so that it properly deallocates memory during failure cases! In other words, if fork() cannot succeed for some reason, it must deallocate any memory that it allocated before it returns the appropriate error code to the calling process.

Don’t forget to make sure that make run-testgetusage still works.

Our kalloc tests run in the kernel and are triggered by a single SYSCALL_TEST_KALLOC system call. This has some advantages, including that the tests can check internal state that should not be accessible to user-level processes. However, some students may want to add other kinds of test infrastructure, such as system calls for allocating and freeing chunks of memory larger than a page. If you do, please be careful of TLB invalidation.

Slab allocation (advanced optional work)

This style of buddy allocator is great for large objects, but because it relies on a pages array with information about each min_order block, it doesn’t scale to small allocation sizes (e.g., 64-byte allocations). For smaller allocations, most systems switch to another allocation strategy, such as slab allocation. Design and implement such an allocator.

Turnin

Fill out psets/pset1answers.md and psets/pset1collab.md and push to GitHub. Then configure our grading server to recognize your code.