Debugging Chickadee – CS 161 doc

High-level tips

When you encounter faulty behavior in your Chickadee kernel, your first hypothesis should not be that the handout code is wrong, or that heavily-used software like gcc is wrong. Yes, very occasionally, the handout code has bugs, and (even more very occasionally) the compilers or virtualization tools have bugs. However, in the overwhelmingly vast majority of situations, misbehavior in a student's Chickadee kernel is caused by buggy code that the student introduced.

A key aspect of bug finding is creating a minimal reproducible test case.

A test case is "minimal" if it specifies the smallest set of of inputs that cause a bug to manifest. Minimality is important because it reduces the possibilities that you must consider as you debug the program. For example:
- Suppose that when you run p-allocator, Chickadee crashes. What happens if you run make NCPU=1 run, such that instead of running Chickadee atop an emulated dual-core machine (the default), Chickadee runs atop an emulated single-core machine? If a bug reproduces in a single-core environment, debug in that environment first! Your debugging job will be simpler because you won't have to worry about concurrent kernel activity on multiple cores. Once you fix the bug on a single-core, try running on 2 or 4 cores to make sure that the bug has really been fixed.
- A test program like p-allocator will typically invoke several different kinds of system calls. For example, p-allocator invokes sys_getpid(), sys_fork(), and sys_yield(). Can you comment out some of those system call invocations while still triggering the kernel bug? If so, debug in that environment first! You'll have fewer kernel behaviors to examine as you hunt for the source of the bug.
A test case is "reproducible" if its inputs always cause a bug to manifest. Reproducible test cases are helpful because they allow you to iteratively (1) create a hypothesis about a bug's cause, (2) add debugging instrumentation to test the hypothesis, and then (3) execute the instrumented test to validate or refute the hypothesis. Some kernel bugs may not be perfectly reproducible. For example, concurrency bugs may exhibit themselves nondeterministically, because minor timing differences between different threads may affect whether race conditions actually lead to inappropriate accesses to program state. [This is one reason why debugging with NCPU = 1 can be so helpful---doing so reduces true concurrency and can often make bugs more deterministic.]

Keep in mind that, as you add new functionality to Chickadee, older features may stop working due to bugs introduced by recent code changes. So, as you add support for new test programs, make sure that you revisit older test programs to ensure that they still work!

Interpreting page faults

The handout Chickadee code does not implement the swapping of virtual memory pages between RAM and a storage device. Thus, Chickadee expects that every valid page in a virtual address space will be present in RAM. If such a page is not in RAM, or if the page is in RAM but has the wrong permissions, the hardware will generate a page fault and the Chickadee kernel will print an error message. Pay close attention to the error message, because it will often give you important hints about the associated kernel bug!

For example, suppose that your implementation of fork() is incorrectly updating the page table of a newly-created process, such that a virtual address v that should be valid is not actually covered by a mapping in the process's page table. If this happens, then user-level code will generate a page fault upon attempting to read or write the memory at v. To see what the associated page fault would look like, we can add the following line of code to the very beginning of process_main in p-allocator:

    void process_main() {
       *(reinterpret_cast<int*>(0xFFFFFFF)) = 42;
       //...rest of code...

That line forcibly simulates a write to a virtual address that won't be mapped into the process's address space. If we do make run, Chickadee will generate the following error message:

    Process 1 page fault for 0xfffffff (write missing page, rip=0x10000b)!

The message is telling us that the instruction at virtual address 0x10000b generated the page fault. In particular, a write to that address failed because no mapped page is associated with the address. Great---but how do we know which instruction is to blame? Well, the virtual address from the error message is located in low canonical memory, so we know that the triggering instruction is in userspace. In other words, the problematic instruction is somewhere in the codebase for p-allocator . . . but which instruction is it? The answer can be found in the obj/ directory that is created during the Chickadee build process. That directory contains .asm files which associate each line of compiled C++ code with the associated assembly instructions (and the virtual addresses of those instructions). Looking at obj/p-allocator.asm, we see that the instruction at 0x10000b is a movl instruction that tries to write the memory address 0x10000b. Finding the triggering instruction allows you to learn the associated C++-level source line; in turn, knowing that line can help you form hypotheses about which piece of kernel functionality is buggy.

Tracing execution paths and viewing state: Three techniques

Debugging a kernel can be challenging. Here are some tricks for identifying the source of a problem:

Logging output: The most straightforward approach is to add log_printf() statements to the kernel, or console_printf() statements to user-level code, to print state and see how it evolves.
Infinite loops: Unfortunately, debug logs might not work in all situations. For example, your kernel might be so messed up during early boot that the kernel cannot properly initialize the console used by log_printf(). In this scenario, you might at least be able to see whether the kernel is reaching a certain point by intentionally inserting an infinite loop into your kernel code; if your kernel hangs, the line of code representing the infinite loop was reached, but if your kernel crashes, the line was not reached. [Keep in mind that, if Chickadee is running on a multicore machine (i.e., if NCPU is greater than 1), the fact that Chickadee is hanging on one core may not force Chickadee to hang on the other cores!]
gdb: Some kinds of bugs may elude both debug logs and infinite loops. For example, deadlocks and other synchronization errors may disappear in the presence of a kernel that outputs a lot of logging statements and/or contains infinite loops---voluminous logging and infinite loops both change thread execution times, which in turn may change how concurrently-executing threads interleave. Perturbing the interleavings may result in the appearance that the concurrency bug has disappeared. [Of course, the bug is still there; the manifestation has just gone away.] In these situations, debugging Chickadee with gdb might be helpful. For example, suppose that you want to debug Chickadee as it runs p-testzombie.cc. From a terminal window in your Linux VM, do make run-gdb-testzombie; this will compile Chickadee and launch qemu as well as gdb. The qemu window will initially be paused. In your terminal window, you'll see the gdb command line. From that command line, you can set breakpoints in both Chickadee code and user-level code like p-testzombie.cc. For example, you can set a breakpoint in p-testzombie.cc by typing break p-testzombie.cc:SOME_LINE_NUMBER into the gdb command line. Then, if you type continue, this will tell gdb to allow qemu to proceed with executing Chickadee. Later, when one of your breakpoints is hit, control will return to gdb. At this point, you can then type things like print SOME_VAR_NAME to see the value for a C++-level variable. You can also do info registers to print all of the register values. If you only want to print one register value, you can do something like print $rsp to just print %rsp.

gdb is very powerful. A variety of tutorials on the Internet discuss how to use gdb. For example, see this page for an overview of the most useful gdb commands.

Debugging memory corruption errors

Sometimes a page fault or an assertion failure will occur because a piece of in-memory kernel state has an incorrect value. Many times, the source of the incorrect value is a straightforward logical error in your kernel code: your code incorrectly maintained a reference count, or accidentally used the pid of a parent process instead of the pid of a child. Alas, some types of state corruptions are more insidious, and are caused by memory corruption bugs. Memory corruption bugs involve erroneous uses of pointers, or incorrect logic in memory management routines like kalloc() and kfree(). Memory corruption bugs can be frustrating to hunt down, because the code which uses the corrupted state may be far away from the code which actually corrupted the state. Here are some common sources of memory errors:

Stack overflows: In Chickadee, each kernel stack is allocated at the top of an "over-allocated" struct proc. The struct proc lives at the bottom of a 4 KB page, with the associated kernel stack living at the top of that page. The consequence is that the kernel stack has less than 4 KB of space to grow! If the stack grows bigger, e.g., due to too many recursive function calls, or due to functions declaring large local variables, then the struct proc (and things beneath the struct proc in adjacent memory pages) may get trampled. This is why pset1 asks you to implement stack canaries. However, note that some kinds of stack corruptions will not be detected by canaries. For example, suppose that a kernel function declares a very large local array, and then only writes to the first element (i.e., the element lowest in memory). That write may "skip over" the stack canary, corrupting memory without corrupting the stack canary!
Incorrect memory copies: A kernel performs a lot of memory copying. For example, during fork(), a kernel has to copy data from parent pages to child pages. As another example, when performing disk IO, the kernel must copy data between kernel buffers and user-level buffers. If a kernel passes incorrect pointers or buffer lengths to memcpy(), or otherwise uses the wrong memory as the source or destination for a copy, the result will be memory corruption!
Errors in memory-management routines: Bugs involving kalloc() and kfree() are incredibly important to eliminate, since those functions are used throughout the kernel. So, in pset1, it is critical that you thoroughly test your allocation code. Chaos will arise if, for example, your kalloc() may sometimes allow "multiple allocations" in which the same physical page is returned by multiple calls to kalloc() even though the page hasn't been kfree()'d inbetween. Such an allocator will likely cause memory corruption errors as multiple processes read and write the same multiply-allocated page. [Launching Chickadee with sanitzers enabled (e.g., make SAN=1 run-testzombie) can help you find some kinds of allocator problems. The sanitizers (1) track which memory is allocated by kalloc() and deallocated by kfree(), and (2) instrument memory reads and memory writes to ensure that those operations target allocated pages. Note that sanitizers will slow your kernel down---the kernel makes a lot of memory references, and each one will trigger the execution of instrumentation code!]
Inappropriate synchronization: If a piece of kernel state is read and written by concurrent threads of execution, but that state is not protected by a lock or another mechanism for ensuring atomicity, the contested state may be corrupted in nondeterminstic ways. If you suspect that your code suffers from synchronization bugs, always try debugging with NCPU = 1 first, to remove true concurrency from the system. Once bugs are stomped out in the single-core case, you can increase concurrency up to 2 and then 4.

Debugging compilation errors and warnings

Chickadee is written in C++, a powerful but complicated language that gives students a lot of opportunities to shoot themselves in the foot. Treat all compiler warnings as if they were errors. Doing so will force you to think more deeply about your code, and will often allow you to fix potential bugs early. Another tip is to add new code in small increments, examining each increment for logical correctness before recompiling, and then recompiling to fix any compiler complaints. A very common (and very bad!) coding methodology often used by students is the following:

drink some coffee and quickly write dozens of lines of new C++ code, without thinking about the overall design goals of the new code;
do not pause to look at the new code and consider whether it satisfies your design goals or has obvious syntactic errors;
recompile Chickadee and be confronted by a blizzard of compiler errors and warnings;
wrestle with the compiler and finally eliminate the compiler errors (but not all the warnings);
run a test program and discover that the barely-compiling kernel is still riddled with semantic bugs.

Don't use this approach! Instead, think about the goals of your new code before you actually write the code. Then, iteratively add a few lines of code, analyze whether the new code is correct using mind power, and recompile, eliminating any compiler errors and warnings that arise. This methodology may seem slow, but it is absolutely guaranteed to save you time and frustration in the long run, particularly if you are new to C++ or to systems-level programming.