High-level tips
When you encounter faulty behavior in your Chickadee kernel, your
first hypothesis should not be that the handout code is wrong, or
that heavily-used software like gcc
is wrong. Yes, very occasionally,
the handout code has bugs, and (even more very occasionally) the
compilers or virtualization tools have bugs. However, in the overwhelmingly
vast majority of situations, misbehavior in a student's Chickadee
kernel is caused by buggy code that the student introduced.
A key aspect of bug finding is creating a minimal reproducible test case.
- A test case is "minimal" if it specifies the smallest set of
of inputs that cause a bug to manifest. Minimality is important
because it reduces the possibilities that you must consider
as you debug the program. For example:
- Suppose that when you run
p-allocator
, Chickadee crashes. What happens if you runmake NCPU=1 run
, such that instead of running Chickadee atop an emulated dual-core machine (the default), Chickadee runs atop an emulated single-core machine? If a bug reproduces in a single-core environment, debug in that environment first! Your debugging job will be simpler because you won't have to worry about concurrent kernel activity on multiple cores. Once you fix the bug on a single-core, try running on2
or4
cores to make sure that the bug has really been fixed. - A test program like
p-allocator
will typically invoke several different kinds of system calls. For example,p-allocator
invokessys_getpid()
,sys_fork()
, andsys_yield()
. Can you comment out some of those system call invocations while still triggering the kernel bug? If so, debug in that environment first! You'll have fewer kernel behaviors to examine as you hunt for the source of the bug.
- Suppose that when you run
- A test case is "reproducible" if its inputs always cause a
bug to manifest. Reproducible test cases are helpful because
they allow you to iteratively (1) create a hypothesis about
a bug's cause, (2) add debugging instrumentation to test the
hypothesis, and then (3) execute the instrumented test to
validate or refute the hypothesis. Some kernel bugs may not
be perfectly reproducible. For example, concurrency bugs may
exhibit themselves nondeterministically, because minor
timing differences between different threads may affect whether
race conditions actually lead to inappropriate accesses to
program state. [This is one reason why debugging with
NCPU = 1
can be so helpful---doing so reduces true concurrency and can often make bugs more deterministic.]
Keep in mind that, as you add new functionality to Chickadee, older features may stop working due to bugs introduced by recent code changes. So, as you add support for new test programs, make sure that you revisit older test programs to ensure that they still work!
Interpreting page faults
The handout Chickadee code does not implement the swapping of virtual memory pages between RAM and a storage device. Thus, Chickadee expects that every valid page in a virtual address space will be present in RAM. If such a page is not in RAM, or if the page is in RAM but has the wrong permissions, the hardware will generate a page fault and the Chickadee kernel will print an error message. Pay close attention to the error message, because it will often give you important hints about the associated kernel bug!
For example, suppose that your implementation of fork()
is
incorrectly updating the page table of a newly-created process,
such that a virtual address v
that should be valid is
not actually covered by a mapping in the process's page table.
If this happens, then user-level code will generate a page
fault upon attempting to read or write the memory at v
. To
see what the associated page fault would look like, we can
add the following line of code to the very beginning of
process_main
in p-allocator
:
void process_main() {
*(reinterpret_cast<int*>(0xFFFFFFF)) = 42;
//...rest of code...
That line forcibly simulates a write to a virtual address that
won't be mapped into the process's address space. If we do
make run
, Chickadee will generate the following error message:
Process 1 page fault for 0xfffffff (write missing page, rip=0x10000b)!
The message is telling us that the instruction at virtual address
0x10000b
generated the page fault. In particular, a write to that
address failed because no mapped page is associated with the address.
Great---but how do we know which instruction is to blame? Well, the
virtual address from the error message is located in low canonical
memory, so we know that the triggering instruction is in userspace.
In other words, the problematic instruction is somewhere in the
codebase for p-allocator
. . . but which instruction is it? The answer
can be found in the obj/
directory that is created during the
Chickadee build process. That directory contains .asm
files which
associate each line of compiled C++ code with the associated assembly
instructions (and the virtual addresses of those instructions).
Looking at obj/p-allocator.asm
, we see that the instruction at
0x10000b
is a movl
instruction that tries to write the memory
address 0x10000b
. Finding the triggering instruction allows you
to learn the associated C++-level source line; in turn, knowing that
line can help you form hypotheses about which piece of kernel
functionality is buggy.
Tracing execution paths and viewing state: Three techniques
Debugging a kernel can be challenging. Here are some tricks for identifying the source of a problem:
-
Logging output: The most straightforward approach is to add
log_printf()
statements to the kernel, orconsole_printf()
statements to user-level code, to print state and see how it evolves. -
Infinite loops: Unfortunately, debug logs might not work in all situations. For example, your kernel might be so messed up during early boot that the kernel cannot properly initialize the console used by
log_printf()
. In this scenario, you might at least be able to see whether the kernel is reaching a certain point by intentionally inserting an infinite loop into your kernel code; if your kernel hangs, the line of code representing the infinite loop was reached, but if your kernel crashes, the line was not reached. [Keep in mind that, if Chickadee is running on a multicore machine (i.e., ifNCPU
is greater than1
), the fact that Chickadee is hanging on one core may not force Chickadee to hang on the other cores!] -
gdb: Some kinds of bugs may elude both debug logs and infinite loops. For example, deadlocks and other synchronization errors may disappear in the presence of a kernel that outputs a lot of logging statements and/or contains infinite loops---voluminous logging and infinite loops both change thread execution times, which in turn may change how concurrently-executing threads interleave. Perturbing the interleavings may result in the appearance that the concurrency bug has disappeared. [Of course, the bug is still there; the manifestation has just gone away.] In these situations, debugging Chickadee with
gdb
might be helpful. For example, suppose that you want to debug Chickadee as it runsp-testzombie.cc
. From a terminal window in your Linux VM, domake run-gdb-testzombie
; this will compile Chickadee and launchqemu
as well asgdb
. Theqemu
window will initially be paused. In your terminal window, you'll see thegdb
command line. From that command line, you can set breakpoints in both Chickadee code and user-level code likep-testzombie.cc
. For example, you can set a breakpoint inp-testzombie.cc
by typingbreak p-testzombie.cc:SOME_LINE_NUMBER
into thegdb
command line. Then, if you typecontinue
, this will tellgdb
to allowqemu
to proceed with executing Chickadee. Later, when one of your breakpoints is hit, control will return togdb
. At this point, you can then type things likeprint SOME_VAR_NAME
to see the value for a C++-level variable. You can also doinfo registers
to print all of the register values. If you only want to print one register value, you can do something likeprint $rsp
to justprint %rsp
.
gdb
is very powerful. A variety of tutorials on the Internet discuss
how to use gdb
. For example, see this page for an
overview of the most useful gdb
commands.
Debugging memory corruption errors
Sometimes a page fault or an assertion failure will occur because
a piece of in-memory kernel state has an incorrect value. Many times,
the source of the incorrect value is a straightforward logical
error in your kernel code: your code incorrectly maintained a
reference count, or accidentally used the pid
of a parent process
instead of the pid
of a child. Alas, some types of state corruptions
are more insidious, and are caused by memory corruption bugs. Memory
corruption bugs involve erroneous uses of pointers, or incorrect
logic in memory management routines like kalloc()
and kfree()
.
Memory corruption bugs can be frustrating to hunt down, because the
code which uses the corrupted state may be far away from the code
which actually corrupted the state. Here are some common sources
of memory errors:
- Stack overflows: In Chickadee, each kernel stack is allocated
at the top of an "over-allocated"
struct proc
. Thestruct proc
lives at the bottom of a 4 KB page, with the associated kernel stack living at the top of that page. The consequence is that the kernel stack has less than 4 KB of space to grow! If the stack grows bigger, e.g., due to too many recursive function calls, or due to functions declaring large local variables, then thestruct proc
(and things beneath thestruct proc
in adjacent memory pages) may get trampled. This is why pset1 asks you to implement stack canaries. However, note that some kinds of stack corruptions will not be detected by canaries. For example, suppose that a kernel function declares a very large local array, and then only writes to the first element (i.e., the element lowest in memory). That write may "skip over" the stack canary, corrupting memory without corrupting the stack canary! - Incorrect memory copies: A kernel performs a lot of memory
copying. For example, during
fork()
, a kernel has to copy data from parent pages to child pages. As another example, when performing disk IO, the kernel must copy data between kernel buffers and user-level buffers. If a kernel passes incorrect pointers or buffer lengths tomemcpy()
, or otherwise uses the wrong memory as the source or destination for a copy, the result will be memory corruption! - Errors in memory-management routines: Bugs involving
kalloc()
andkfree()
are incredibly important to eliminate, since those functions are used throughout the kernel. So, in pset1, it is critical that you thoroughly test your allocation code. Chaos will arise if, for example, yourkalloc()
may sometimes allow "multiple allocations" in which the same physical page is returned by multiple calls tokalloc()
even though the page hasn't beenkfree()
'd inbetween. Such an allocator will likely cause memory corruption errors as multiple processes read and write the same multiply-allocated page. [Launching Chickadee with sanitzers enabled (e.g.,make SAN=1 run-testzombie
) can help you find some kinds of allocator problems. The sanitizers (1) track which memory is allocated bykalloc()
and deallocated bykfree()
, and (2) instrument memory reads and memory writes to ensure that those operations target allocated pages. Note that sanitizers will slow your kernel down---the kernel makes a lot of memory references, and each one will trigger the execution of instrumentation code!] - Inappropriate synchronization: If a piece of kernel state is
read and written by concurrent threads of execution, but that state
is not protected by a lock or another mechanism for ensuring
atomicity, the contested state may be corrupted in nondeterminstic
ways. If you suspect that your code suffers from synchronization
bugs, always try debugging with
NCPU = 1
first, to remove true concurrency from the system. Once bugs are stomped out in the single-core case, you can increase concurrency up to2
and then4
.
Debugging compilation errors and warnings
Chickadee is written in C++, a powerful but complicated language that gives students a lot of opportunities to shoot themselves in the foot. Treat all compiler warnings as if they were errors. Doing so will force you to think more deeply about your code, and will often allow you to fix potential bugs early. Another tip is to add new code in small increments, examining each increment for logical correctness before recompiling, and then recompiling to fix any compiler complaints. A very common (and very bad!) coding methodology often used by students is the following:
- drink some coffee and quickly write dozens of lines of new C++ code, without thinking about the overall design goals of the new code;
- do not pause to look at the new code and consider whether it satisfies your design goals or has obvious syntactic errors;
- recompile Chickadee and be confronted by a blizzard of compiler errors and warnings;
- wrestle with the compiler and finally eliminate the compiler errors (but not all the warnings);
- run a test program and discover that the barely-compiling kernel is still riddled with semantic bugs.
Don't use this approach! Instead, think about the goals of your new code before you actually write the code. Then, iteratively add a few lines of code, analyze whether the new code is correct using mind power, and recompile, eliminating any compiler errors and warnings that arise. This methodology may seem slow, but it is absolutely guaranteed to save you time and frustration in the long run, particularly if you are new to C++ or to systems-level programming.