Problem set 4 – CS 161 psets

A. Buffer cache eviction
B. VFS integration and reads
C. Simple writes and writeback
D. File sizes and extension
E. Creating files

Turnin

Fill out psets/pset4answers.md and psets/pset4collab.md and push to GitHub. Then submit on the grading server.

You should complete all parts by 11:59pm on Wednesday 4/3.

Handout

Our handout code offers:

A file system definition (chickadeefs.hh), described here.
- chickadeefs.hh also defines the Chickadee file system journal format, which is not part of this pset.
A class ahcistate (in k-ahci.hh) that accesses disks that speak the SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can.
A class bufcache (in k-chkfs.hh and k-chkfs.cc) that implements a buffer cache.
- The buffer cache is a fully-associative cache with ten slots of type bcslot. (A full-featured operating system would have millions of slots in its buffer cache; we set it to 10 for testing purposes.)
A class chkfsstate (in k-chkfs.hh and k-chkfs.cc) that accesses the sata_disk’s Chickadee file system via the buffer cache.
A class chkfs_fileiter (in k-chkfsiter.hh) that iterates over the blocks in a ChickadeeFS file (somewhat like how vmiter iterates over the pages in an x86-64 page table).
A class chkfs::journalreplayer that knows how to replay journal records.
An implementation of a system call sys_readdiskfile that uses bufcache, chkfsstate, and chkfs_fileiter to read disk files.
Programs p-readdiskfile.cc and p-wcdiskfile.cc that use sys_readdiskfile to examine disk files.

You'll see that init_hardware already initializes the local SATA disk:

k-init.cc:58

    // initialize SATA drive
    sata_disk = ahcistate::find();
    if (sata_disk && sata_disk->irq_ > 0) {
        cpus[ncpu - 1].enable_irq(sata_disk->irq_);
    }

Try running make run-readdiskfile! You will see that reading disk files is painfully slow. This is because Chickadee’s buffer cache stinks. In this problem set, you will fix the buffer cache and ahcistate; hook the file system up to your VFS from problem set 3; and add write support.

Background: Buffer cache smart pointers

All Chickadee kernel disk accesses use the buffer cache, which holds in-memory versions of blocks stored on disk. The buffer cache is a fully-associative cache: each of its slots might hold any disk block.¹ Over time, entries will be loaded, evicted, and then later reused, possibly for different blocks. This introduces synchronization issues: the buffer cache must not overwrite an entry that a kernel task is actively using. Chickadee therefore tracks which entries are currently in use using reference counts.

Reference counting can be tedious, so Chickadee represents references to buffer cache slots with smart pointers. A smart pointer can be treated like a regular pointer, but it also holds a reference on the relevant slot. When the smart pointer goes out of scope, it automatically decrements the corresponding reference count. (This resembles the way spinlock guards release a spinlock when they go out of scope.)

The bufcache::load function, which loads a disk block into the buffer cache, returns a bcref object, which is a smart pointer to a bcslot. Here’s how you can use the bcref:

bcref slot = bufcache::get().load(5);  // obtain reference to slot holding block 5
if (!slot) {                           // test if slot loaded successfully
    panic();
}
assert(slot->bn_ == 5);                // can treat `slot` like a pointer to `bcslot`
bcslot* slotptr = slot.get();          // can get true pointer to underlying `bcslot`

However, you cannot copy a bcref object. The bcref represents ownership of a reference, and ownership can’t be copied. This will fail:

bcref slot_copy = slot;                // fails at compile time

You can, however, transfer ownership using a C++ construct called std::move. After transferring ownership, the original bcref will become empty:

bcref new_owner{std::move(slot)};     // `std::move` transfers ownership
assert(new_owner->bn_ == 5);          // now `new_owner` has the reference
assert(!slot);                        // and `slot` is empty

bcref semantics resemble Rust ownership in some ways.

Background: In-memory inodes

The ChickadeeFS physical file system design features inodes, which are structures that represent a file’s contents (size, extents, etc.) independent of the file name used to access that file. Your virtual file system will find it useful to work with inodes as well. The vnode structure for an open ChickadeeFS file will typically contain a reference to the corresponding ChickadeeFS inode. While the file remains open, the buffer cache slot containing that inode must remain in memory.

Another smart pointer class, chkfs_iref, represents references to ChickadeeFS inodes. The chkfsstate::inode() and chkfsstate::lookup_inode() functions return chkfs_iref objects. A chkfs_iref encapsulates a reference to the corresponding buffer cache slot, and when the chkfs_iref goes out of scope, it automatically decrements the corresponding slot’s reference count. As with bcref, chkfs_iref objects cannot be copied, but they can be moved.

Since chkfs_iref objects hold references to buffer cache slots, you can obtain a pointer to the corresponding slot with the iref->slot() function. It’s always true that iref->slot()->contains(iref.get()).

For ease of programming reasons, chkfs::inode defines some functions that are only meaningful on inodes in the buffer cache, and the ChickadeeFS inode structure reserves some space for locking and reference information that’s only meaningful in memory. A production kernel would generally implement a layer of indirection that separated on-disk inode structures from in-memory structures.

Background: Locks and buffer cache data

Smart pointers automate reference counting for buffer cache slots. While a buffer cache slot reference exists, some aspects of the slot are constant, and can be referenced without a lock:

slot->bn_, the corresponding block number
slot->buf_, the location of the memory page holding the block data

Other invariants also hold; for instance, a slot must not be evicted while a reference is held, so slot->state_ will be either s_clean or s_dirty.

But the contents of the slot->buf_ memory page are not constant, and as you implement file system writing, you will need to add mutual-exclusion locks to your code to protect access to slot->buf_ and to inode contents.

Your code should use three related sets of locking calls. Inodes have a readers–writer lock, and buffer cache slots have a mutual exclusion lock for their contents.

iref->lock_read() acquires a read lock on a chkfs_iref. This read lock should prevent other tasks from writing the inode or modifying the contents of the corresponding file. For instance, you should acquire a read lock before reading an inode’s size or inode’s extents, and/or reading the corresponding data blocks.
- iref->unlock_read() releases a read lock previously acquired with lock_read().
- Several tasks can simultaneously hold read locks on the same inode.
iref->lock_write() acquires a write lock on a chkfs_iref. This write lock should prevent other tasks from writing or reading the inode or associated file contents. You must acquire a write lock before changing an inode’s data or modifying the corresponding data blocks.
- iref->unlock_write() releases a write lock previously acquired with lock_write().
- While any task holds a write lock on an inode, no other task can hold a read or write lock on the same inode.
- It is impossible to upgrade a read lock to a write lock, so no task should call iref->lock_read(); iref->lock_write() on the same inode (that will cause deadlock).
slot->lock_buffer() acquires a mutual-exclusion lock corresponding to the contents of a buffer cache slot (the “contents lock”). For instance, you must call lock_buffer() before modifying inode data or a file data block.
- slot->unlock_buffer() releases a lock previously acquired with slot->lock_buffer().
- slot->lock_buffer() also changes the slot’s state_ to s_dirty.
- If slot was obtained via an iref, and that iref was locked for reading, then it is not necessary to call slot->lock_buffer() before reading the underlying contents.

Acquiring these locks might yield control of the CPU. All of these locks do not disable interrupts (so there’s no irqstate involved). Furthermore, all of these locks require the caller to hold a reference to the corresponding inode or slot. A task may yield or block while holding an inode or contents lock (but it should not do so forever!).

In order to modify the contents of an inode, you need both the inode’s write lock and the underlying contents lock:

// increment an inode’s link count
iref->lock_write();
iref->slot()->lock_buffer();     // `iref->slot()` is the slot containing this inode block
++iref->nlink;
iref->slot()->unlock_buffer();
iref->unlock_write();

// erase first data block
iref->lock_write();
{
    auto slot = chkfs_fileiter(iref.get()).load();
    slot->lock_buffer();
    memcpy(slot->buf_, 0, chkfs::blocksize);
    slot->unlock_buffer();
}
iref->unlock_write();

Both kinds of lock are necessary because of write-back caching in Part C. When your write-back caching code wants to write a dirty block to the disk, it will call lock_buffer() on the corresponding slot. This ensures that the block data remains unchanged while the block is in flight to the disk, and thus that the disk image stays consistent. The write-back caching code must use the slot lock and not an inode lock because (1) not all blocks have a corresponding inode, and (2) there is no way to find the inode that points to a given data-block slot.

That said, assuming you use inode locks correctly, any inode lock makes the corresponding buffer data safe to read. The write-back process does not modify buffer data.

Inode locks precede contents locks in the lock order.

Inode locks are more finely grained than contents locks. An inode block contains 64 inodes, and each of those inodes is in a distinct region of memory. Therefore, it’s safe to read the data in one inode even while another inode is being modified. The fine granularity is helpful because a single inode block (and, therefore, buffer cache slot) contains up to 64 independent inodes.

A. Buffer cache eviction

In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, thus allowing some read requests to succeed without accessing the disk.

The code that implements immediate eviction is in bcslot::decrement_reference_count():

    spinlock_guard guard(lock_);    // needed in case we `clear()`
    assert(ref_ != 0);
    if (--ref_ == 0) {
        clear();
    }

That is, as soon as a slot becomes unreferenced, it is cleared. Your first step will be to change this code to just decrement the reference count:

    assert(ref_ != 0);
    --ref_;

After that simple change, the bufcache::load() function will be able to find and reuse unreferenced slots containing previously-loaded blocks, and things will run much faster.

But don’t stop there! At this point, slots are never evicted, so only a handful of reads will succeed before the buffer cache fills up with garbage. You need an eviction policy, which selects unreferenced block(s) to evict when the cache is full. That policy can be located in its own function, but you will likely call it from bufcache::load.

Which eviction policy should you choose? At first, keep it simple. You must obey the invariant that only clean, unreferenced blocks may be evicted—only blocks with ref_ == 0 && state_ == s_clean are safe to clear. You must also obey the buffer cache synchronization invariants (although you may add to or change these). Other than that, pick any block you’d like, such as the first unreferenced block in the cache.

Test your work first with make run-readdiskfile, which reads a very short file one byte at a time, and with make run-wcdiskfile, which reads longer files 125 bytes at a time. You can also run the readdiskfile and wcdiskfile programs with arguments after make run-sh.

Optimizations

Once you’ve got a simple eviction policy working, consider improving it.

You might want to treat different kinds of blocks in different ways. For example, you could always keep the superblock in memory, or you could treat inode blocks differently from data blocks.
Implement a least recently used policy or another policy of your choice.
Cache important data outside the buffer cache. For example, you could store a copy of the superblock in a member variable of chkfsstate.
A simple improvement is to raise the number of slots in the buffer cache. However, your buffer cache must work correctly for any number of slots greater than or equal to 10. (The small slot count stresses your eviction logic.)

B. VFS integration and reads

In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.

`execv`

First, change your execv implementation to use the disk file system. Model your code on syscall_readdiskfile. Use make run-execallocexit—or make run-sh, followed by some commands like echo foo—to test your work.

You’ll likely define a new derived class of proc_loader that calls chkfs_fileiter::load.

VFS

Next, add a new VFS type for Chickadee file system files. You will add a new derived type of vnode. Most likely, its constructor will take a chkfs_iref argument (and use std::move) and its operations will use chkfs_fileiter; closing the vnode must dereference the corresponding inode (likely by destroying the chkfs_iref member variable). Remember to think about your synchronization plan.

`open`

Finally, change your open implementation to read the Chickadee file system rather than memfiles. (Don’t support OF_WRITE or OF_TRUNC yet.)

Use make run-sh, and then cat thoreau.txt, to check your work. (The thoreau.txt file is included only on the disk file system.)

C. Simple writes and writeback

In the next parts, you’ll extend your Chickadee file system VFS to support writes. You don’t need to worry about file system crash safety yet.

Writes

First, support opening disk files for writing (i.e., the OF_WRITE flag without OF_TRUNC or OF_CREATE), and support writing to such files.

You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.

In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks. You will need to implement synchronization; see the background sections above.

Use make run-testwritefs to test this functionality.

Writeback

Next, support writing data back to the disk via the sys_sync system call, which is implemented by bufcache::sync().

Change the bufcache::sync() function to write all dirty slots (state_ == s_dirty) back to disk, marking those slots as clean when it completes.

Use make cleanfs run-testwritefs fsck to test this functionality.

The cleanfs command rebuilds the disk image, which matters because testwritefs assumes it starts with a clean disk image.

Writeback hints

A slot is called dirty when it has been modified relative to the on-disk version. In our handout code, the bcslot::state_ member is s_dirty for possibly-dirty slots, and the bcslot::lock_buffer() locking function marks a slot as dirty.

The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes. The bcslot::lock_buffer() function can be useful here; you may also want to add one or more wait queues to facilitate efficient blocking.

A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or slots.)

for each slot in the buffer cache {
    if (slot->state_ == s_dirty) {
        slot->lock_buffer();
        write block to disk, blocking until write is complete;
        slot->state_ = s_clean;
        slot->unlock_buffer();
    }
}

But walking the whole buffer cache takes O(bufcache::ne) time, even if very little cached data is dirty. It’s faster to maintain a linked list of dirty slots. The writeback procedure would look something like this (locking still elided):

while (slot = dirty_list_.pop_front()) {
    slot->lock_buffer();
    write block to disk, blocking until write is complete;
    slot->state_ = s_clean;
    slot->unlock_buffer();
}

But what if other processes keep writing to new blocks while the writeback procedure is running? The writeback procedure might never reach the end of the dirty_list_! A simple swapping trick avoids the problem:

list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcslot* e = mydirty.pop_front()) {
    // ...
}

This trick might also let you release a long-term lock on the buffer cache during the sync process.

Advanced optional work: Parallel writes. All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!

Debugging writeback errors

Some writeback errors, such as checksum failures, can be hard to debug from within the operating system. Use our helper programs to investigate the disk’s state from outside. For instance, say you experience a checksum failure on file thoreau.txt. Try the following command:

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less

The -e thoreau.txt option tells chickadeefsck to search the file system image for the thoreau.txt file, and write its contents to standard output. Look through the contents; do they look reasonable? For instance, do they contain any weird-looking sequences of ^@ (null characters)?

If the contents look unreasonable, try using hexdump to drill down further.

$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less

Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?

Use the -V option to investigate the file system’s specific layout, including the extents used by each file. This can help you discover which data block corresponds to some offset in a file.

$ obj/chickadeefsck -V chickadeefs.img

superblock: nblocks 32768, ninodes 896
  blocks: self 1, swap 0, fbb 1, inodes 14, data 32688, journal 64
journal: no metablocks found
inode 1 (root directory): size 4096, type directory, nlink 1
  [0]: extent 177+1
    dirent 0: name kernel, inode 2
    dirent 1: name emerson.txt, inode 3
    dirent 2: name dickinson.txt, inode 4
    dirent 3: name thoreau.txt, inode 5
    ...
    dirent 31: name wcdiskfile, inode 33
inode 2 (kernel): size 355512, type data, nlink 1
  [0]: extent 16+87
inode 3 (emerson.txt): size 130, type data, nlink 1
  [0]: extent 103+1
inode 4 (dickinson.txt): size 649, type data, nlink 1
  [0]: extent 104+1
inode 5 (thoreau.txt): size 685, type data, nlink 1
  [0]: extent 105+1
inode 6 (wheatley.txt): size 1958, type data, nlink 1
  [0]: extent 106+1
inode 7 (allocator): size 6432, type data, nlink 1
  [0]: extent 107+2
...

D. File sizes and extension

In this section, you’ll support making files bigger and smaller and seeking within files. Use make cleanfs run-testwritefs2 to test.

`OF_TRUNC`

Implement the OF_TRUNC flag to sys_open. When OF_TRUNC and OF_WRITE are both given, the file size should be set to zero.

This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.

`sys_lseek`

Implement the sys_lseek system call, which changes a file’s position and returns the new position.

Chickadee’s sys_lseek should behave generally like Linux’s lseek. sys_lseek(fd, off, LSEEK_SET) sets the file position to off; sys_lseek(fd, off, LSEEK_CUR) should adjust the file position by off; and sys_lseek(fd, off, LSEEK_END) should set the file position to the end of the file plus off. In all these cases, the system call returns the new file position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE), which returns the file’s size without changing the position (off is ignored). Return E_SPIPE for unseekable files, such as pipes and the keyboard/console.

Unlike Linux, you only need to support seeks within a file for now: you may return E_INVAL for seeks outside a file’s allocated size.

File extension

Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!

Use make cleanfs run-testwritefs2 fsck to test. When testwritefs2 completes successfully, press q to quit QEMU; then Chickadee’s file system checker should run. You should see something like this in the terminal:

kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
  RM chickadeefs.img
  CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
  QEMU chickadeeboot.img
  FSCK chickadeefs.img
* File system OK

File extension hints

Complete the chkfsstate::allocate_extent() function so it allocates a free extent. This function will load the free block bitmap into the buffer cache; find a contiguous range of 1 bits (indicating a free extent), likely using bitset_view; set those bits to 0; and return the first block number in the extent.

Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 2¹⁵ blocks (128 MiB). It’s OK to not handle larger file systems.
Your buffer cache may fill up with dirty blocks. One way to handle this case: if bufcache::load cannot allocate an entry but observed an unreferenced dirty entry, it could release its locks, call sync, and then try again.

E. Creating files

Add support for the OF_CREATE flag to sys_open. When specified in combination with OF_WRITE, this flag should create a file of the supplied name if no such file exists.

To create a file, you’ll need to:

Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
Allocate an inode.
Store the inode number and filename in the direntry.
Mark the inode as allocated (set its type, size, and nlink count).

Use make cleanfs run-testwritefs3 fsck to test your work. You should also now be able to do cool things like create new files from your shell.

Optional: Prefetching

Another way to speed up reads is to update the buffer cache to load blocks that will likely be used in the future. This will require a prefetching policy, which decides which blocks to read in advance of their being needed.

Prefetching is very useful, but may require more code surgery than eviction. You will need to change the ahcistate::read_or_write function to use more than just command slot 0. You also don’t want a process to block just because its kernel task is prefetching data! You could implement a separate kernel task that just does prefetching, but there are other possibilities. Consider, for example, adding a int fetch_status_ to bcslot, and adding a new function to ahcistate a signature something like this:

// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
                               volatile int& fetch_status, /* maybe more arguments */);

You can change the existing read_or_write function to call read_or_write_nonblocking.

The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!

Optional: More file system features

Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.

Add support for the sys_unlink system call, which removes a file. Use make cleanfs run-testwritefs4 fsck to test your work.
Add support for the sys_rename system call, which changes a file’s name.
Add subdirectory support: sys_mkdir and sys_rmdir system calls, and support for moving files between directories.
Add support for special kinds of file, such as /dev/null or /dev/random.

To receive extra credit for implementing one or more of these features, you must create new test programs (e.g., p-testmkrmdir.cc) which demonstrate that your kernel code is correct!

Yes, this means the buffer cache is a hashmap. ↩︎

Problem set 4: Disk and buffer cache