Problem set 4 – CS 161

Handout

Our handout code offers:

A file system definition (chickadeefs.hh, described here and here).
A class ahcistate (in k-ahci.hh) that accesses disks that speak the modern SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can.
A class bufcache (in k-chkfs.hh and k-chkfs.cc) that implements a buffer cache.
A class chkfsstate (in k-chkfs.hh and k-chkfs.cc) that accesses the sata_disk’s Chickadee file system via the buffer cache.
A class chkfs_fileiter (in k-chkfsiter.hh) that iterates over the blocks in a ChickadeeFS file (somewhat like how vmiter iterates over the pages in an x86-64 page table).
A class chkfs::journalreplayer that knows how to replay journal records.
An implementation of a system call sys_readdiskfile that uses bufcache, chkfsstate, and chkfs_fileiter to read disk files.
Programs p-readdiskfile.cc and p-wcdiskfile.cc that use sys_readdiskfile to examine disk files.

You'll see that init_hardware already initializes the local SATA disk:

    // initialize SATA drive
    sata_disk = ahcistate::find();
    if (sata_disk && sata_disk->irq_ > 0) {
        cpus[ncpu - 1].enable_irq(sata_disk->irq_);
    }

Try running make run-readdiskfile! You will see that reading disk files is painfully slow. This is because Chickadee’s buffer cache and disk programming interfaces are woefully underpowered. In this problem set, you will fix the buffer cache and ahcistate; hook the file system up to your VFS from problem set 3; and add write support.

A. Buffer cache

In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, and it should also contain blocks that will likely be used in the future.

To implement a good buffer cache, you will need an eviction policy, which decides which (unreferenced) block to evict when the cache is full but a new block should be added. LRU (least recently used) is a good first choice, but there are other good choices, such as policies that treat different kinds of blocks differently. For example, it’s probably a good idea to always keep the superblock in memory—or to introduce a separate variable, such as a chkfsstate member variable, that holds the superblock’s data. And it might be a good idea to treat inode blocks differently from data blocks.

You will also want to raise the number of entries in the buffer cache, although your buffer cache should work for any number of entries greater than or equal to 10.

Your buffer cache must obey the invariant that only unreferenced blocks can be evicted. That is, any block with bcentry::ref_ > 0 must not be evicted. It must also obey the buffer cache synchronization invariants. You may change the synchronization invariants if you’d like, but make sure to avoid undefined behavior.

You will also want a prefetching policy, which decides which blocks to read in advance of their being needed.

Before working too much on the code, you should specify your policies in a short text document and discuss your policies with a TF. Among other things, your write-up should discuss how you will synchronize between prefetching activity and eviction activity.

Test your work first with make run-readdiskfile, which reads a very short file one byte at a time, and with make run-wcdiskfile, which reads longer files 125 bytes at a time. You can also run the readdiskfile and wcdiskfile programs with arguments after make run-sh.

Prefetching hints

Do eviction first. You must do eviction before progressing to Part B, but you can delay prefetching until much later—in fact, you can delay it until the rest of the problem set is done. Prefetching will require more code surgery than eviction.

You will need to change the ahcistate::read_or_write function to use more than just command slot 0. You also don’t want a process to block just because its kernel task is prefetching data! You could implement a separate kernel task that just does prefetching. But there are other possibilities. Consider, for example, adding a int fetch_status_ to bcentry, and adding a new function to ahcistate a signature something like this:

// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
                               volatile int& fetch_status, /* maybe more arugments */);

You can change the existing read_or_write function to call read_or_write_nonblocking.

The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!

B. VFS integration and reads

In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.

Background: Chickadeefs in-memory inodes

Chickadeefs inodes have a built-in synchronization plan that you should understand. This plan involves the buffer cache and the inode field mlock (which is only meaningful in memory).

The chkfsstate::get_inode and put_inode functions manipulate buffer cache reference counts to ensure that the returned inode remains in memory. That means that callers must call put_inode to “free” an inode when done with it.

The mlock field is a read/write lock for the inode’s contents. The lock can be held for either reading or writing. While the read lock is held, the inode’s contents cannot change; the write lock is required to modify the corresponding file (e.g., to add blocks to it or modify its contents). mlock should not be held across system calls: it should be acquired immediately before a kernel read or write operation, and released immediately after. Do not, for example, call inode::lock_write simply because a process opened the corresponding file for writing. mlock is manipulated by inode::lock_read, inode::lock_write, inode::unlock_read, and inode::unlock_write. We provide these functions for you.

mlock is cleared to 0 when an inode block is read from disk; see clean_inode_block().

`execv`

First, change your execv implementation to use the disk file system. Model your code on syscall_readdiskfile. Use make run-execallocexit—or make run-sh, followed by some commands like echo foo—to test your work.

You’ll likely define a new derived class of proc_loader that calls chkfs_fileiter::get_disk_entry and bcentry::put.

VFS

Next, add a new VFS type for Chickadee file system files. You will add a new derived type of vnode. Most likely, its constructor will take a chkfs::inode* argument and its operations will use chkfs_fileiter; closing the vnode will put the corresponding inode. Remember to think about your synchronization plan.

`open`

Finally, change your open implementation to read the Chickadee file system rather than memfiles. (Don’t support OF_WRITE or OF_TRUNC yet.)

Use make run-sh, and then cat thoreau.txt, to check your work. (The thoreau.txt file is included only on the disk file system.)

C. Simple writes and writeback

In the next parts, you’ll extend your Chickadee file system VFS to support writes. We have divided write support into multiple subparts. You don’t need to worry about file system crash safety yet. Read all of the Part C description before starting your design and implementation!

Writes

First, support opening disk files for writing (i.e., the OF_WRITE flag without OF_TRUNC or OF_CREATE), and support writing to such files.

You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.

In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks.

Use make run-testwritefs to test this functionality.

Writeback

Next, support writing data back to the disk via the sys_sync system call, which is implemented by bufcache::sync().

Change the bufcache::sync() function to write all “dirty buffers” (i.e., buffer cache entries whose data is modified relative to the disk version) back to disk.

Use make cleanfs run-testwritefs to test this functionality.

The cleanfs command rebuilds the disk image, which matters because testwritefs assumes it starts with a clean disk image.

Writeback hints

Basics: You’ll need to track which buffers are dirty, meaning modified in memory relative to the on-disk version. For example, you could add a state bcentry::estate_t state_dirty that is set when data is modified and cleared when modifications are written back to disk. Your eviction policy from Part A must not evict dirty entries.

Synchronization: The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes.

A simple synchronization method that works is a per-entry write reference count that can be either zero or one. Complete these functions:

// Obtain a write reference for this entry. Blocks until there are no other
// write references, then obtains a reference and returns. (Can also be a
// good place to maintain dirty flags.)
void bcentry::get_write();

// Release the write reference obtained by a prior call to `get_write()`.
void bcentry::put_write();

Your VFS will obtain a write reference before modifying a block, and release that reference afterward. (The VFS should not hold write references long term.) Your bufcache::sync function will likewise obtain write references. You may also want to add one or more wait queues to facilitate efficient blocking.

Note that a write reference must not prevent other kernel tasks from holding normal references to the same entry (it’s not a readers/writer lock).

Dirty lists: A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or bufentries.)

for each bcentry in the buffer cache {
    if (bcentry is dirty) {
        get write reference;
        write bcentry to disk, blocking until write is complete;
        set state to clean and put write reference;
    }
}

But walking the whole buffer cache takes O(bufcache::ne) time, even if very little cached data is dirty. It’s faster to maintain a linked list of dirty bcentry structures. The writeback procedure would look something like this (locking still elided):

while (bcentry* e = dirty_list_.pop_front()) {
    get write reference;
    write bcentry to disk, blocking until write is complete;
    set state to clean and put write reference;
}

But what if other processes keep writing to new blocks while the writeback procedure is running? The writeback procedure might never reach the end of the dirty_list_! A simple swapping trick avoids the problem:

list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcentry* e = mydirty.pop_front()) {
    // ...
}

This trick might also let you release a long-term lock on the buffer cache during the sync process.

Advanced optional work: Parallel writes. All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That swill be faster!

Invariants and concurrency in the buffer cache and the file system

When thinking about synchronization in pset4, it's helpful to consider that synchronization invariants exist at multiple levels in the kernel.

The buffer cache (i.e., bcentry and bufcache) deal with disk blocks. At the buffer cache level, synchronization is needed to ensure that (for example), at any given time, a particular disk block can exist in at most one location in the buffer cache. The buffer cache synchronization invariants are described here.
You can think of the file system proper (e.g., chkfsstate and chkfs_fileiter and your Chickadee VFS implementation) as a separate piece of software that leverages the buffer cache to create higher-level semantics over disk blocks. The buffer cache is unenlightened about whether a disk block contains a superblock or inode or data block; in contrast, the file system knows about these entities, and has to enforce invariants involving those entities. An example of such an invariant is that, at any given moment, an inode can be associated with multiple concurrent reads of the file's data, or a single writer of that file's data.

So, let's consider what happens when writing a ChickadeeFS file via the VFS layer. Ignoring a few details, the Chickadee vnode might be initialized with a reference to the appropriate inode. That reference would be generated by a call to chkfsstate::get_inode(inum_t inum). Inside chkfsstate::get_inode(inum_t inum), several calls are made to bufcache::get_disk_entry(), and eventually this code executes:

auto inode_entry = bc.get_disk_entry(bn, clean_inode_block);
ino = reinterpret_cast<inode*>(inode_entry->buf_);
return ino;

Note that the call to bc::get_disk_entry() involved the maintenance of buffer-cache-level invariants, such as the the buffer now having a positive bcentry::ref_ count. This buffer-cache-level invariant involving reference counts is helpful for implementing buffer-cache-level behaviors like dropping buffer cache entries: an entry is safe to drop if it has a reference count of zero. This invariant doesn't require the buffer cache to know what is inside a particular disk block (e.g., an inode or a data block or whatever).

So, back to the issue of a ChickadeeFS vnode trying to write a file. The vnode has a reference to the appropriate inode. The vnode's write() implementation must do the following stuff:

Call inode::lock_write() to enforce the file-system-level invariant that no concurrent reads or writes can be happening to a file.
Do some calculations to determine which disk block to issue the write to; if you're extending the file, you'll have to perform some extent management.
Get a pointer to the appropriate disk block to update, e.g., using chkfs_fileiter::get_disk_entry(). [Strong hint: chkfs_fileiter is your friend!] The only way to get a pointer to a disk block is to go through the buffer cache directly or indirectly (e.g., via chkfs_fileiter). So, the pointer to the disk block to write to is a bcentry*. The bcentry will already have a positive reference count because we fetched it via the buffer cache interface (which properly increments reference counts). However, we will need to decrement the reference count at some point :-D.
Call bcentry::get_write() to get exclusive write access to the block. You have to implement the code inside of bcentry::get_write(). At a high level, it needs to associate a write_ref flag with the bcentry, such that the write reference can be taken (and the write_ref flag flipped) only if the flag is not currently held; otherwise, the task should block until the flag can be flipped. Note that acquisition of a write reference should not prevent other processes from getting normal references to the bcentry via buffer cache interfaces! This is what the pset4 description means when it says that "a write reference must not prevent other kernel tasks from holding normal references to the same entry (it’s not a readers/writer lock)." [Something to consider: How does calling bcentry::get_write() interact with a concurrent attempt by another process to call bufcache::sync()?]
Now that you have a write reference, you can do what needs to be done to update the buffer. So, do this stuff.
Release the write reference on the buffer by calling bcentry::put_write(). [Things to consider: Can the buffer cache now flush the bcentry to disk? Can another process issue a write to the file, or does that process have to wait for something else to happen?]
Call bcentry::put() to decrement the reference count (which is important for maintaining buffer-cache-level semantics).
Call inode::unlock_write().

Nota bene that, depending on your implementation, you will likely need to do additional stuff than what is described above. However, this overview hopefully provides more insights into the interactions between the buffer cache and the file system code.

Debugging writeback errors

Some writeback errors, such as checksum failures, can be hard to debug from within the operating system. Use our helper programs to investigate the disk’s state from outside. For instance, say you experience a checksum failure on file thoreau.txt. Try the following command:

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less

The -e thoreau.txt option tells chickadeefsck to search the file system image for the thoreau.txt file, and write its contents to standard output. Look through the contents; do they look reasonable? For instance, do they contain any weird-looking sequences of ^@ (null characters)?

If the contents look unreasonable, try using hexdump to drill down further.

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less

Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?

Use the -V option to investigate the file system’s specific layout, including the extents used by each file. This can help you discover which data block corresponds to some offset in a file.

$ obj/chickadeefsck -V chickadeefs.img

journal: no metablocks found
inode 1 @root directory: size 4096, type directory, nlink 1
  [0]: extent 177+1
    #0 "kernel": inode 2
    #1 "emerson.txt": inode 3
    #2 "dickinson.txt": inode 4
    #3 "thoreau.txt": inode 5
    ...
    #31 "wcdiskfile": inode 33
inode 2 @kernel: size 355512, type data, nlink 1
  [0]: extent 16+87
inode 3 @emerson.txt: size 130, type data, nlink 1
  [0]: extent 103+1
inode 4 @dickinson.txt: size 649, type data, nlink 1
  [0]: extent 104+1
inode 5 @thoreau.txt: size 685, type data, nlink 1
  [0]: extent 105+1
inode 6 @wheatley.txt: size 1958, type data, nlink 1
  [0]: extent 106+1
inode 7 @allocator: size 6432, type data, nlink 1
  [0]: extent 107+2
...

D. File sizes and extension

In this section, you’ll support making files bigger and smaller and seeking within files. Use make cleanfs run-testwritefs2 to test.

`OF_TRUNC`

Implement the OF_TRUNC flag to sys_open. When OF_TRUNC and OF_WRITE are both given, the file size should be set to zero.

This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.

`sys_lseek`

Implement the sys_lseek system call, which changes a file’s position and returns the new position.

Chickadee’s sys_lseek should behave generally like Linux’s lseek. sys_lseek(fd, off, LSEEK_SET) sets the file position to off; sys_lseek(fd, off, LSEEK_CUR) should adjust the file position by off; and sys_lseek(fd, off, LSEEK_END) should set the file position to the end of the file plus off. In all these cases, the system call returns the new file position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE), which returns the file’s size without changing the position (off is ignored). Return E_SPIPE for unseekable files, such as pipes and the keyboard/console.

Unlike Linux, you only need to support seeks within a file for now: you may return E_INVAL for seeks outside a file’s allocated size.

File extension

Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!

Use make cleanfs run-testwritefs2 fsck to test. When testwritefs2 completes successfully, press q to quit QEMU; then Chickadee’s file system checker should run. You should see something like this in the terminal:

kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
  RM chickadeefs.img
  CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
  QEMU chickadeeboot.img
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
unknown keycodes `(unnamed)', please report to qemu-devel@nongnu.org
  FSCK chickadeefs.img
* File system OK

File extension hints

Complete the chkfsstate::allocate_extent() function so it allocates a free extent. This function will load the free block bitmap into the buffer cache; find a contiguous range of 1 bits (indicating a free extent), likely using bitset_view; set those bits to 0; and return the first block number in the extent.

Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 2¹⁵ blocks (128 MiB). It’s OK to not handle larger file systems.
Your buffer cache may fill up with dirty blocks. One way to handle this case: if bufcache::get_disk_entry cannot allocate an entry but observed an unreferenced dirty entry, it could release its locks, call sync, and then try again.

E. Creating files

Add support for the OF_CREATE flag to sys_open. When specified in combination with OF_WRITE, this flag should create a file of the supplied name if no such file exists.

To create a file, you’ll need to:

Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
Allocate an inode.
Store the inode number and filename in the direntry.
Mark the inode as allocated (set its type, size, and nlink count).

Use make cleanfs run-testwritefs3 fsck to test your work. You should also now be able to do cool things like create new files from your shell.

F. Optional features

Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.

Add support for the sys_unlink system call, which removes a file. Use make cleanfs run-testwritefs4 fsck to test your work.
Add support for the sys_rename system call, which changes a file’s name.
Add subdirectory support: sys_mkdir and sys_rmdir system calls, and support for moving files between directories.
Add support for special kinds of file, such as /dev/null or /dev/random.

To receive extra credit for implementing one or more of these features, you must create new test programs (e.g., p-testmkrmdir.cc) which demonstrate that your kernel code is correct!

Turnin

Fill out psets/pset4answers.md and psets/pset4collab.md and push to GitHub. Then submit on the grading server.

Checkin: You should complete all parts by 11:59pm on Monday 4/5.

Problem set 4: Disk and buffer cache