- A. Buffer cache eviction
- B. VFS integration and reads
- C. Simple writes and writeback
- D. File sizes and extension
- E. Creating files
Turnin
Fill out psets/pset4answers.md
and psets/pset4collab.md
and push to
GitHub. Then submit on the grading server.
You should complete all parts by 11:59pm on Wednesday 4/3.
Handout
Our handout code offers:
- A file system definition (
chickadeefs.hh
), described here.chickadeefs.hh
also defines the Chickadee file system journal format, which is not part of this pset.
- A class
ahcistate
(ink-ahci.hh
) that accesses disks that speak the SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can. - A class
bufcache
(ink-chkfs.hh
andk-chkfs.cc
) that implements a buffer cache.- The buffer cache is a fully-associative cache with ten slots of type
bcslot
. (A full-featured operating system would have millions of slots in its buffer cache; we set it to 10 for testing purposes.)
- The buffer cache is a fully-associative cache with ten slots of type
- A class
chkfsstate
(ink-chkfs.hh
andk-chkfs.cc
) that accesses thesata_disk
’s Chickadee file system via the buffer cache. - A class
chkfs_fileiter
(ink-chkfsiter.hh
) that iterates over the blocks in a ChickadeeFS file (somewhat like howvmiter
iterates over the pages in an x86-64 page table). - A class
chkfs::journalreplayer
that knows how to replay journal records. - An implementation of a system call
sys_readdiskfile
that usesbufcache
,chkfsstate
, andchkfs_fileiter
to read disk files. - Programs
p-readdiskfile.cc
andp-wcdiskfile.cc
that usesys_readdiskfile
to examine disk files.
You'll see that init_hardware
already initializes the local SATA disk:
// initialize SATA drive
sata_disk = ahcistate::find();
if (sata_disk && sata_disk->irq_ > 0) {
cpus[ncpu - 1].enable_irq(sata_disk->irq_);
}
Try running make run-readdiskfile
! You will see that reading disk files is
painfully slow. This is because Chickadee’s buffer cache stinks. In this
problem set, you will fix the buffer cache and ahcistate
; hook the file
system up to your VFS from problem set 3; and add write support.
Background: Buffer cache smart pointers
All Chickadee kernel disk accesses use the buffer cache, which holds in-memory versions of blocks stored on disk. The buffer cache is a fully-associative cache: each of its slots might hold any disk block.1 Over time, entries will be loaded, evicted, and then later reused, possibly for different blocks. This introduces synchronization issues: the buffer cache must not overwrite an entry that a kernel task is actively using. Chickadee therefore tracks which entries are currently in use using reference counts.
Reference counting can be tedious, so Chickadee represents references to buffer cache slots with smart pointers. A smart pointer can be treated like a regular pointer, but it also holds a reference on the relevant slot. When the smart pointer goes out of scope, it automatically decrements the corresponding reference count. (This resembles the way spinlock guards release a spinlock when they go out of scope.)
The bufcache::load
function, which loads a disk block into the buffer cache,
returns a bcref
object, which is a smart pointer to a bcslot
. Here’s how
you can use the bcref
:
bcref slot = bufcache::get().load(5); // obtain reference to slot holding block 5
if (!slot) { // test if slot loaded successfully
panic();
}
assert(slot->bn_ == 5); // can treat `slot` like a pointer to `bcslot`
bcslot* slotptr = slot.get(); // can get true pointer to underlying `bcslot`
However, you cannot copy a bcref
object. The bcref
represents ownership of
a reference, and ownership can’t be copied. This will fail:
bcref slot_copy = slot; // fails at compile time
You can, however, transfer ownership using a C++ construct called
std::move
. After transferring ownership, the original bcref
will become
empty:
bcref new_owner{std::move(slot)}; // `std::move` transfers ownership
assert(new_owner->bn_ == 5); // now `new_owner` has the reference
assert(!slot); // and `slot` is empty
bcref
semantics resemble Rust ownership in some ways.
Background: In-memory inodes
The ChickadeeFS physical file system design features inodes, which are structures that represent a file’s contents (size, extents, etc.) independent of the file name used to access that file. Your virtual file system will find it useful to work with inodes as well. The vnode structure for an open ChickadeeFS file will typically contain a reference to the corresponding ChickadeeFS inode. While the file remains open, the buffer cache slot containing that inode must remain in memory.
Another smart pointer class, chkfs_iref
, represents references to
ChickadeeFS inodes. The chkfsstate::inode()
and chkfsstate::lookup_inode()
functions return chkfs_iref
objects. A chkfs_iref
encapsulates a reference
to the corresponding buffer cache slot, and when the chkfs_iref
goes out of
scope, it automatically decrements the corresponding slot’s reference count.
As with bcref
, chkfs_iref
objects cannot be copied, but they can be moved.
Since chkfs_iref
objects hold references to buffer cache slots, you can
obtain a pointer to the corresponding slot with the iref->slot()
function.
It’s always true that iref->slot()->contains(iref.get())
.
For ease of programming reasons,
chkfs::inode
defines some functions that are only meaningful on inodes in the buffer cache, and the ChickadeeFS inode structure reserves some space for locking and reference information that’s only meaningful in memory. A production kernel would generally implement a layer of indirection that separated on-disk inode structures from in-memory structures.
Background: Locks and buffer cache data
Smart pointers automate reference counting for buffer cache slots. While a buffer cache slot reference exists, some aspects of the slot are constant, and can be referenced without a lock:
slot->bn_
, the corresponding block numberslot->buf_
, the location of the memory page holding the block data
Other invariants also hold; for instance, a slot must not be evicted while a
reference is held, so slot->state_
will be either s_clean
or s_dirty
.
But the contents of the slot->buf_
memory page are not constant, and as
you implement file system writing, you will need to add mutual-exclusion locks
to your code to protect access to slot->buf_
and to inode contents.
Your code should use three related sets of locking calls. Inodes have a readers–writer lock, and buffer cache slots have a mutual exclusion lock for their contents.
iref->lock_read()
acquires a read lock on achkfs_iref
. This read lock should prevent other tasks from writing the inode or modifying the contents of the corresponding file. For instance, you should acquire a read lock before reading an inode’s size or inode’s extents, and/or reading the corresponding data blocks.iref->unlock_read()
releases a read lock previously acquired withlock_read()
.- Several tasks can simultaneously hold read locks on the same inode.
iref->lock_write()
acquires a write lock on achkfs_iref
. This write lock should prevent other tasks from writing or reading the inode or associated file contents. You must acquire a write lock before changing an inode’s data or modifying the corresponding data blocks.iref->unlock_write()
releases a write lock previously acquired withlock_write()
.- While any task holds a write lock on an inode, no other task can hold a read or write lock on the same inode.
- It is impossible to upgrade a read lock to a write lock, so no task
should call
iref->lock_read(); iref->lock_write()
on the same inode (that will cause deadlock).
slot->lock_buffer()
acquires a mutual-exclusion lock corresponding to the contents of a buffer cache slot (the “contents lock”). For instance, you must calllock_buffer()
before modifying inode data or a file data block.slot->unlock_buffer()
releases a lock previously acquired withslot->lock_buffer()
.slot->lock_buffer()
also changes the slot’sstate_
tos_dirty
.- If
slot
was obtained via aniref
, and thatiref
was locked for reading, then it is not necessary to callslot->lock_buffer()
before reading the underlying contents.
Acquiring these locks might yield control of the CPU. All of these locks do
not disable interrupts (so there’s no irqstate
involved). Furthermore, all
of these locks require the caller to hold a reference to the corresponding
inode or slot. A task may yield or block while holding an inode or contents
lock (but it should not do so forever!).
In order to modify the contents of an inode, you need both the inode’s write lock and the underlying contents lock:
// increment an inode’s link count
iref->lock_write();
iref->slot()->lock_buffer(); // `iref->slot()` is the slot containing this inode block
++iref->nlink;
iref->slot()->unlock_buffer();
iref->unlock_write();
// erase first data block
iref->lock_write();
{
auto slot = chkfs_fileiter(iref.get()).load();
slot->lock_buffer();
memcpy(slot->buf_, 0, chkfs::blocksize);
slot->unlock_buffer();
}
iref->unlock_write();
Both kinds of lock are necessary because of write-back caching in Part
C. When your write-back caching code wants to write a dirty block to
the disk, it will call lock_buffer()
on the corresponding slot. This ensures
that the block data remains unchanged while the block is in flight to the
disk, and thus that the disk image stays consistent. The write-back caching
code must use the slot lock and not an inode lock because (1) not all blocks
have a corresponding inode, and (2) there is no way to find the inode that
points to a given data-block slot.
That said, assuming you use inode locks correctly, any inode lock makes the corresponding buffer data safe to read. The write-back process does not modify buffer data.
Inode locks precede contents locks in the lock order.
Inode locks are more finely grained than contents locks. An inode block contains 64 inodes, and each of those inodes is in a distinct region of memory. Therefore, it’s safe to read the data in one inode even while another inode is being modified. The fine granularity is helpful because a single inode block (and, therefore, buffer cache slot) contains up to 64 independent inodes.
A. Buffer cache eviction
In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, thus allowing some read requests to succeed without accessing the disk.
The code that implements immediate eviction is in
bcslot::decrement_reference_count()
:
spinlock_guard guard(lock_); // needed in case we `clear()`
assert(ref_ != 0);
if (--ref_ == 0) {
clear();
}
That is, as soon as a slot becomes unreferenced, it is cleared. Your first step will be to change this code to just decrement the reference count:
assert(ref_ != 0);
--ref_;
After that simple change, the bufcache::load()
function will be able to find
and reuse unreferenced slots containing previously-loaded blocks, and things
will run much faster.
But don’t stop there! At this point, slots are never evicted, so only a
handful of reads will succeed before the buffer cache fills up with garbage.
You need an eviction policy, which selects unreferenced block(s) to evict
when the cache is full. That policy can be located in its own function, but
you will likely call it from bufcache::load
.
Which eviction policy should you choose? At first, keep it simple. You must
obey the invariant that only clean, unreferenced blocks may be
evicted—only blocks with ref_ == 0 && state_ == s_clean
are safe to clear.
You must also obey the buffer cache synchronization
invariants (although you may
add to or change these). Other than that, pick any block you’d like, such as
the first unreferenced block in the cache.
Test your work first with make run-readdiskfile
, which reads a very short
file one byte at a time, and with make run-wcdiskfile
, which reads longer
files 125 bytes at a time. You can also run the readdiskfile
and
wcdiskfile
programs with arguments after make run-sh
.
Optimizations
Once you’ve got a simple eviction policy working, consider improving it.
- You might want to treat different kinds of blocks in different ways. For example, you could always keep the superblock in memory, or you could treat inode blocks differently from data blocks.
- Implement a least recently used policy or another policy of your choice.
- Cache important data outside the buffer cache. For example, you could
store a copy of the superblock in a member variable of
chkfsstate
. - A simple improvement is to raise the number of slots in the buffer cache. However, your buffer cache must work correctly for any number of slots greater than or equal to 10. (The small slot count stresses your eviction logic.)
B. VFS integration and reads
In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.
execv
First, change your execv
implementation to use the disk file system. Model
your code on syscall_readdiskfile
. Use make run-execallocexit
—or make run-sh
, followed by some commands like echo foo
—to test your work.
You’ll likely define a new derived class of proc_loader
that calls
chkfs_fileiter::load
.
VFS
Next, add a new VFS type for Chickadee file system files. You will add a new
derived type of vnode. Most likely, its constructor will take a chkfs_iref
argument (and use std::move
) and its operations will use chkfs_fileiter
;
closing the vnode must dereference the corresponding inode (likely by
destroying the chkfs_iref
member variable). Remember to think about your
synchronization plan.
open
Finally, change your open
implementation to read the Chickadee file system
rather than memfile
s. (Don’t support OF_WRITE
or OF_TRUNC
yet.)
Use make run-sh
, and then cat thoreau.txt
, to check your work. (The
thoreau.txt
file is included only on the disk file system.)
C. Simple writes and writeback
In the next parts, you’ll extend your Chickadee file system VFS to support writes. You don’t need to worry about file system crash safety yet.
Writes
First, support opening disk files for writing (i.e., the OF_WRITE
flag
without OF_TRUNC
or OF_CREATE
), and support writing to such files.
You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.
In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks. You will need to implement synchronization; see the background sections above.
Use make run-testwritefs
to test this functionality.
Writeback
Next, support writing data back to the disk via the sys_sync
system call,
which is implemented by bufcache::sync()
.
Change the bufcache::sync()
function to write all dirty slots (state_ == s_dirty
) back to disk, marking those slots as clean when it completes.
Use make cleanfs run-testwritefs fsck
to test this functionality.
The
cleanfs
command rebuilds the disk image, which matters becausetestwritefs
assumes it starts with a clean disk image.
Writeback hints
A slot is called dirty when it has been modified relative to the on-disk
version. In our handout code, the bcslot::state_
member is s_dirty
for
possibly-dirty slots, and the bcslot::lock_buffer()
locking function marks a
slot as dirty.
The writeback process must write an internally consistent version of each
dirty buffer, meaning that buffer contents must not be modified or freed while
the buffer is in flight to the disk. This requires both delaying writeback
until concurrent writes complete, and delaying new concurrent writes and
evictions until writeback completes. The bcslot::lock_buffer()
function can
be useful here; you may also want to add one or more wait queues to facilitate
efficient blocking.
A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or slots.)
for each slot in the buffer cache {
if (slot->state_ == s_dirty) {
slot->lock_buffer();
write block to disk, blocking until write is complete;
slot->state_ = s_clean;
slot->unlock_buffer();
}
}
But walking the whole buffer cache takes O(bufcache::ne
) time, even if
very little cached data is dirty. It’s faster to maintain a linked list of
dirty slots. The writeback procedure would look something like this (locking
still elided):
while (slot = dirty_list_.pop_front()) {
slot->lock_buffer();
write block to disk, blocking until write is complete;
slot->state_ = s_clean;
slot->unlock_buffer();
}
But what if other processes keep writing to new blocks while the writeback
procedure is running? The writeback procedure might never reach the end of the
dirty_list_
! A simple swapping trick avoids the problem:
list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcslot* e = mydirty.pop_front()) {
// ...
}
This trick might also let you release a long-term lock on the buffer cache during the sync process.
Advanced optional work: Parallel writes. All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!
Debugging writeback errors
Some writeback errors, such as checksum failures, can be hard to debug from
within the operating system. Use our helper programs to investigate the disk’s
state from outside. For instance, say you experience a checksum failure on
file thoreau.txt
. Try the following command:
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less
The -e thoreau.txt
option tells chickadeefsck
to search the file system
image for the thoreau.txt
file, and write its contents to standard output.
Look through the contents; do they look reasonable? For instance, do they
contain any weird-looking sequences of ^@
(null characters)?
If the contents look unreasonable, try using hexdump
to drill down further.
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less
Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?
Use the -V
option to investigate the file system’s specific layout,
including the extents used by each file. This can help you discover which data
block corresponds to some offset in a file.
$ obj/chickadeefsck -V chickadeefs.img
superblock: nblocks 32768, ninodes 896
blocks: self 1, swap 0, fbb 1, inodes 14, data 32688, journal 64
journal: no metablocks found
inode 1 (root directory): size 4096, type directory, nlink 1
[0]: extent 177+1
dirent 0: name kernel, inode 2
dirent 1: name emerson.txt, inode 3
dirent 2: name dickinson.txt, inode 4
dirent 3: name thoreau.txt, inode 5
...
dirent 31: name wcdiskfile, inode 33
inode 2 (kernel): size 355512, type data, nlink 1
[0]: extent 16+87
inode 3 (emerson.txt): size 130, type data, nlink 1
[0]: extent 103+1
inode 4 (dickinson.txt): size 649, type data, nlink 1
[0]: extent 104+1
inode 5 (thoreau.txt): size 685, type data, nlink 1
[0]: extent 105+1
inode 6 (wheatley.txt): size 1958, type data, nlink 1
[0]: extent 106+1
inode 7 (allocator): size 6432, type data, nlink 1
[0]: extent 107+2
...
D. File sizes and extension
In this section, you’ll support making files bigger and smaller and seeking
within files. Use make cleanfs run-testwritefs2
to test.
OF_TRUNC
Implement the OF_TRUNC
flag to sys_open
. When OF_TRUNC
and OF_WRITE
are both given, the file size should be set to zero.
This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.
sys_lseek
Implement the sys_lseek
system call, which changes a file’s position and
returns the new position.
Chickadee’s sys_lseek
should behave generally like Linux’s lseek
.
sys_lseek(fd, off, LSEEK_SET)
sets the file position to off
;
sys_lseek(fd, off, LSEEK_CUR)
should adjust the file position by off
; and
sys_lseek(fd, off, LSEEK_END)
should set the file position to the end of the
file plus off
. In all these cases, the system call returns the new file
position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE)
, which
returns the file’s size without changing the position (off
is ignored).
Return E_SPIPE
for unseekable files, such as pipes and the keyboard/console.
Unlike Linux, you only need to support seeks within a file for now: you may
return E_INVAL
for seeks outside a file’s allocated size.
File extension
Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!
Use make cleanfs run-testwritefs2 fsck
to test. When testwritefs2
completes successfully, press q
to quit QEMU; then Chickadee’s file system
checker should run. You should see something like this in the terminal:
kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
RM chickadeefs.img
CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
QEMU chickadeeboot.img
FSCK chickadeefs.img
* File system OK
File extension hints
-
Complete the
chkfsstate::allocate_extent()
function so it allocates a free extent. This function will load the free block bitmap into the buffer cache; find a contiguous range of 1 bits (indicating a free extent), likely usingbitset_view
; set those bits to 0; and return the first block number in the extent.Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 215 blocks (128 MiB). It’s OK to not handle larger file systems.
-
Your buffer cache may fill up with dirty blocks. One way to handle this case: if
bufcache::load
cannot allocate an entry but observed an unreferenced dirty entry, it could release its locks, callsync
, and then try again.
E. Creating files
Add support for the OF_CREATE
flag to sys_open
. When specified in
combination with OF_WRITE
, this flag should create a file of the supplied
name if no such file exists.
To create a file, you’ll need to:
- Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
- Allocate an inode.
- Store the inode number and filename in the direntry.
- Mark the inode as allocated (set its type, size, and
nlink
count).
Use make cleanfs run-testwritefs3 fsck
to test your work. You should also
now be able to do cool things like create new files from your shell.
Optional: Prefetching
Another way to speed up reads is to update the buffer cache to load blocks that will likely be used in the future. This will require a prefetching policy, which decides which blocks to read in advance of their being needed.
Prefetching is very useful, but may require more code surgery than eviction.
You will need to change the ahcistate::read_or_write
function to use more
than just command slot 0. You also don’t want a process to block just because
its kernel task is prefetching data! You could implement a separate kernel
task that just does prefetching, but there are other possibilities. Consider,
for example, adding a int fetch_status_
to bcslot
, and adding a new
function to ahcistate
a signature something like this:
// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
volatile int& fetch_status, /* maybe more arguments */);
You can change the existing read_or_write
function to call
read_or_write_nonblocking
.
The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!
Optional: More file system features
Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.
- Add support for the
sys_unlink
system call, which removes a file. Usemake cleanfs run-testwritefs4 fsck
to test your work. - Add support for the
sys_rename
system call, which changes a file’s name. - Add subdirectory support:
sys_mkdir
andsys_rmdir
system calls, and support for moving files between directories. - Add support for special kinds of file, such as
/dev/null
or/dev/random
.
To receive extra credit for implementing one or more of these features,
you must create new test programs (e.g., p-testmkrmdir.cc
) which
demonstrate that your kernel code is correct!
-
Yes, this means the buffer cache is a hashmap. ↩︎