Problem set 4: Disk and buffer cache

Turnin

Fill out psets/pset4answers.md and psets/pset4collab.md and push to GitHub. Then submit on the grading server.

You should complete all parts by 11:59pm on Wednesday 4/3.

Handout

Our handout code offers:

You'll see that init_hardware already initializes the local SATA disk:

    // initialize SATA drive
    sata_disk = ahcistate::find();
    if (sata_disk && sata_disk->irq_ > 0) {
        cpus[ncpu - 1].enable_irq(sata_disk->irq_);
    }

Try running make run-readdiskfile! You will see that reading disk files is painfully slow. This is because Chickadee’s buffer cache stinks. In this problem set, you will fix the buffer cache and ahcistate; hook the file system up to your VFS from problem set 3; and add write support.

Background: Buffer cache smart pointers

All Chickadee kernel disk accesses use the buffer cache, which holds in-memory versions of blocks stored on disk. The buffer cache is a fully-associative cache: each of its slots might hold any disk block.1 Over time, entries will be loaded, evicted, and then later reused, possibly for different blocks. This introduces synchronization issues: the buffer cache must not overwrite an entry that a kernel task is actively using. Chickadee therefore tracks which entries are currently in use using reference counts.

Reference counting can be tedious, so Chickadee represents references to buffer cache slots with smart pointers. A smart pointer can be treated like a regular pointer, but it also holds a reference on the relevant slot. When the smart pointer goes out of scope, it automatically decrements the corresponding reference count. (This resembles the way spinlock guards release a spinlock when they go out of scope.)

The bufcache::load function, which loads a disk block into the buffer cache, returns a bcref object, which is a smart pointer to a bcslot. Here’s how you can use the bcref:

bcref slot = bufcache::get().load(5);  // obtain reference to slot holding block 5
if (!slot) {                           // test if slot loaded successfully
    panic();
}
assert(slot->bn_ == 5);                // can treat `slot` like a pointer to `bcslot`
bcslot* slotptr = slot.get();          // can get true pointer to underlying `bcslot`

However, you cannot copy a bcref object. The bcref represents ownership of a reference, and ownership can’t be copied. This will fail:

bcref slot_copy = slot;                // fails at compile time

You can, however, transfer ownership using a C++ construct called std::move. After transferring ownership, the original bcref will become empty:

bcref new_owner{std::move(slot)};     // `std::move` transfers ownership
assert(new_owner->bn_ == 5);          // now `new_owner` has the reference
assert(!slot);                        // and `slot` is empty

bcref semantics resemble Rust ownership in some ways.

Background: In-memory inodes

The ChickadeeFS physical file system design features inodes, which are structures that represent a file’s contents (size, extents, etc.) independent of the file name used to access that file. Your virtual file system will find it useful to work with inodes as well. The vnode structure for an open ChickadeeFS file will typically contain a reference to the corresponding ChickadeeFS inode. While the file remains open, the buffer cache slot containing that inode must remain in memory.

Another smart pointer class, chkfs_iref, represents references to ChickadeeFS inodes. The chkfsstate::inode() and chkfsstate::lookup_inode() functions return chkfs_iref objects. A chkfs_iref encapsulates a reference to the corresponding buffer cache slot, and when the chkfs_iref goes out of scope, it automatically decrements the corresponding slot’s reference count. As with bcref, chkfs_iref objects cannot be copied, but they can be moved.

Since chkfs_iref objects hold references to buffer cache slots, you can obtain a pointer to the corresponding slot with the iref->slot() function. It’s always true that iref->slot()->contains(iref.get()).

For ease of programming reasons, chkfs::inode defines some functions that are only meaningful on inodes in the buffer cache, and the ChickadeeFS inode structure reserves some space for locking and reference information that’s only meaningful in memory. A production kernel would generally implement a layer of indirection that separated on-disk inode structures from in-memory structures.

Background: Locks and buffer cache data

Smart pointers automate reference counting for buffer cache slots. While a buffer cache slot reference exists, some aspects of the slot are constant, and can be referenced without a lock:

Other invariants also hold; for instance, a slot must not be evicted while a reference is held, so slot->state_ will be either s_clean or s_dirty.

But the contents of the slot->buf_ memory page are not constant, and as you implement file system writing, you will need to add mutual-exclusion locks to your code to protect access to slot->buf_ and to inode contents.

Your code should use three related sets of locking calls. Inodes have a readers–writer lock, and buffer cache slots have a mutual exclusion lock for their contents.

Acquiring these locks might yield control of the CPU. All of these locks do not disable interrupts (so there’s no irqstate involved). Furthermore, all of these locks require the caller to hold a reference to the corresponding inode or slot. A task may yield or block while holding an inode or contents lock (but it should not do so forever!).

In order to modify the contents of an inode, you need both the inode’s write lock and the underlying contents lock:

// increment an inode’s link count
iref->lock_write();
iref->slot()->lock_buffer();     // `iref->slot()` is the slot containing this inode block
++iref->nlink;
iref->slot()->unlock_buffer();
iref->unlock_write();

// erase first data block
iref->lock_write();
{
    auto slot = chkfs_fileiter(iref.get()).load();
    slot->lock_buffer();
    memcpy(slot->buf_, 0, chkfs::blocksize);
    slot->unlock_buffer();
}
iref->unlock_write();

Both kinds of lock are necessary because of write-back caching in Part C. When your write-back caching code wants to write a dirty block to the disk, it will call lock_buffer() on the corresponding slot. This ensures that the block data remains unchanged while the block is in flight to the disk, and thus that the disk image stays consistent. The write-back caching code must use the slot lock and not an inode lock because (1) not all blocks have a corresponding inode, and (2) there is no way to find the inode that points to a given data-block slot.

That said, assuming you use inode locks correctly, any inode lock makes the corresponding buffer data safe to read. The write-back process does not modify buffer data.

Inode locks precede contents locks in the lock order.

Inode locks are more finely grained than contents locks. An inode block contains 64 inodes, and each of those inodes is in a distinct region of memory. Therefore, it’s safe to read the data in one inode even while another inode is being modified. The fine granularity is helpful because a single inode block (and, therefore, buffer cache slot) contains up to 64 independent inodes.

A. Buffer cache eviction

In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, thus allowing some read requests to succeed without accessing the disk.

The code that implements immediate eviction is in bcslot::decrement_reference_count():

    spinlock_guard guard(lock_);    // needed in case we `clear()`
    assert(ref_ != 0);
    if (--ref_ == 0) {
        clear();
    }

That is, as soon as a slot becomes unreferenced, it is cleared. Your first step will be to change this code to just decrement the reference count:

    assert(ref_ != 0);
    --ref_;

After that simple change, the bufcache::load() function will be able to find and reuse unreferenced slots containing previously-loaded blocks, and things will run much faster.

But don’t stop there! At this point, slots are never evicted, so only a handful of reads will succeed before the buffer cache fills up with garbage. You need an eviction policy, which selects unreferenced block(s) to evict when the cache is full. That policy can be located in its own function, but you will likely call it from bufcache::load.

Which eviction policy should you choose? At first, keep it simple. You must obey the invariant that only clean, unreferenced blocks may be evicted—only blocks with ref_ == 0 && state_ == s_clean are safe to clear. You must also obey the buffer cache synchronization invariants (although you may add to or change these). Other than that, pick any block you’d like, such as the first unreferenced block in the cache.

Test your work first with make run-readdiskfile, which reads a very short file one byte at a time, and with make run-wcdiskfile, which reads longer files 125 bytes at a time. You can also run the readdiskfile and wcdiskfile programs with arguments after make run-sh.

Optimizations

Once you’ve got a simple eviction policy working, consider improving it.

B. VFS integration and reads

In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.

execv

First, change your execv implementation to use the disk file system. Model your code on syscall_readdiskfile. Use make run-execallocexit—or make run-sh, followed by some commands like echo foo—to test your work.

You’ll likely define a new derived class of proc_loader that calls chkfs_fileiter::load.

VFS

Next, add a new VFS type for Chickadee file system files. You will add a new derived type of vnode. Most likely, its constructor will take a chkfs_iref argument (and use std::move) and its operations will use chkfs_fileiter; closing the vnode must dereference the corresponding inode (likely by destroying the chkfs_iref member variable). Remember to think about your synchronization plan.

open

Finally, change your open implementation to read the Chickadee file system rather than memfiles. (Don’t support OF_WRITE or OF_TRUNC yet.)

Use make run-sh, and then cat thoreau.txt, to check your work. (The thoreau.txt file is included only on the disk file system.)

C. Simple writes and writeback

In the next parts, you’ll extend your Chickadee file system VFS to support writes. You don’t need to worry about file system crash safety yet.

Writes

First, support opening disk files for writing (i.e., the OF_WRITE flag without OF_TRUNC or OF_CREATE), and support writing to such files.

You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.

In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks. You will need to implement synchronization; see the background sections above.

Use make run-testwritefs to test this functionality.

Writeback

Next, support writing data back to the disk via the sys_sync system call, which is implemented by bufcache::sync().

Change the bufcache::sync() function to write all dirty slots (state_ == s_dirty) back to disk, marking those slots as clean when it completes.

Use make cleanfs run-testwritefs fsck to test this functionality.

The cleanfs command rebuilds the disk image, which matters because testwritefs assumes it starts with a clean disk image.

Writeback hints

A slot is called dirty when it has been modified relative to the on-disk version. In our handout code, the bcslot::state_ member is s_dirty for possibly-dirty slots, and the bcslot::lock_buffer() locking function marks a slot as dirty.

The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes. The bcslot::lock_buffer() function can be useful here; you may also want to add one or more wait queues to facilitate efficient blocking.

A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or slots.)

for each slot in the buffer cache {
    if (slot->state_ == s_dirty) {
        slot->lock_buffer();
        write block to disk, blocking until write is complete;
        slot->state_ = s_clean;
        slot->unlock_buffer();
    }
}

But walking the whole buffer cache takes O(bufcache::ne) time, even if very little cached data is dirty. It’s faster to maintain a linked list of dirty slots. The writeback procedure would look something like this (locking still elided):

while (slot = dirty_list_.pop_front()) {
    slot->lock_buffer();
    write block to disk, blocking until write is complete;
    slot->state_ = s_clean;
    slot->unlock_buffer();
}

But what if other processes keep writing to new blocks while the writeback procedure is running? The writeback procedure might never reach the end of the dirty_list_! A simple swapping trick avoids the problem:

list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcslot* e = mydirty.pop_front()) {
    // ...
}

This trick might also let you release a long-term lock on the buffer cache during the sync process.

Advanced optional work: Parallel writes. All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!

Debugging writeback errors

Some writeback errors, such as checksum failures, can be hard to debug from within the operating system. Use our helper programs to investigate the disk’s state from outside. For instance, say you experience a checksum failure on file thoreau.txt. Try the following command:

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less

The -e thoreau.txt option tells chickadeefsck to search the file system image for the thoreau.txt file, and write its contents to standard output. Look through the contents; do they look reasonable? For instance, do they contain any weird-looking sequences of ^@ (null characters)?

If the contents look unreasonable, try using hexdump to drill down further.

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less

Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?

Use the -V option to investigate the file system’s specific layout, including the extents used by each file. This can help you discover which data block corresponds to some offset in a file.

$ obj/chickadeefsck -V chickadeefs.img
superblock: nblocks 32768, ninodes 896
  blocks: self 1, swap 0, fbb 1, inodes 14, data 32688, journal 64
journal: no metablocks found
inode 1 (root directory): size 4096, type directory, nlink 1
  [0]: extent 177+1
    dirent 0: name kernel, inode 2
    dirent 1: name emerson.txt, inode 3
    dirent 2: name dickinson.txt, inode 4
    dirent 3: name thoreau.txt, inode 5
    ...
    dirent 31: name wcdiskfile, inode 33
inode 2 (kernel): size 355512, type data, nlink 1
  [0]: extent 16+87
inode 3 (emerson.txt): size 130, type data, nlink 1
  [0]: extent 103+1
inode 4 (dickinson.txt): size 649, type data, nlink 1
  [0]: extent 104+1
inode 5 (thoreau.txt): size 685, type data, nlink 1
  [0]: extent 105+1
inode 6 (wheatley.txt): size 1958, type data, nlink 1
  [0]: extent 106+1
inode 7 (allocator): size 6432, type data, nlink 1
  [0]: extent 107+2
...

D. File sizes and extension

In this section, you’ll support making files bigger and smaller and seeking within files. Use make cleanfs run-testwritefs2 to test.

OF_TRUNC

Implement the OF_TRUNC flag to sys_open. When OF_TRUNC and OF_WRITE are both given, the file size should be set to zero.

This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.

sys_lseek

Implement the sys_lseek system call, which changes a file’s position and returns the new position.

Chickadee’s sys_lseek should behave generally like Linux’s lseek. sys_lseek(fd, off, LSEEK_SET) sets the file position to off; sys_lseek(fd, off, LSEEK_CUR) should adjust the file position by off; and sys_lseek(fd, off, LSEEK_END) should set the file position to the end of the file plus off. In all these cases, the system call returns the new file position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE), which returns the file’s size without changing the position (off is ignored). Return E_SPIPE for unseekable files, such as pipes and the keyboard/console.

Unlike Linux, you only need to support seeks within a file for now: you may return E_INVAL for seeks outside a file’s allocated size.

File extension

Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!

Use make cleanfs run-testwritefs2 fsck to test. When testwritefs2 completes successfully, press q to quit QEMU; then Chickadee’s file system checker should run. You should see something like this in the terminal:

kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
  RM chickadeefs.img
  CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
  QEMU chickadeeboot.img
  FSCK chickadeefs.img
* File system OK

File extension hints

E. Creating files

Add support for the OF_CREATE flag to sys_open. When specified in combination with OF_WRITE, this flag should create a file of the supplied name if no such file exists.

To create a file, you’ll need to:

Use make cleanfs run-testwritefs3 fsck to test your work. You should also now be able to do cool things like create new files from your shell.

Optional: Prefetching

Another way to speed up reads is to update the buffer cache to load blocks that will likely be used in the future. This will require a prefetching policy, which decides which blocks to read in advance of their being needed.

Prefetching is very useful, but may require more code surgery than eviction. You will need to change the ahcistate::read_or_write function to use more than just command slot 0. You also don’t want a process to block just because its kernel task is prefetching data! You could implement a separate kernel task that just does prefetching, but there are other possibilities. Consider, for example, adding a int fetch_status_ to bcslot, and adding a new function to ahcistate a signature something like this:

// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
                               volatile int& fetch_status, /* maybe more arguments */);

You can change the existing read_or_write function to call read_or_write_nonblocking.

The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!

Optional: More file system features

Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.

To receive extra credit for implementing one or more of these features, you must create new test programs (e.g., p-testmkrmdir.cc) which demonstrate that your kernel code is correct!


  1. Yes, this means the buffer cache is a hashmap. ↩︎