Handout
Our handout code offers:
- A file system definition (
chickadeefs.hh
, described here and here (although note that this pset won't involve the journaling aspect of Chickadee's file system)). - A class
ahcistate
(ink-ahci.hh
) that accesses disks that speak the modern SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can. - A class
bufcache
(ink-chkfs.hh
andk-chkfs.cc
) that implements a buffer cache. - A class
chkfsstate
(ink-chkfs.hh
andk-chkfs.cc
) that accesses thesata_disk
’s Chickadee file system via the buffer cache. - A class
chkfs_fileiter
(ink-chkfsiter.hh
) that iterates over the blocks in a ChickadeeFS file (somewhat like howvmiter
iterates over the pages in an x86-64 page table). - A class
chkfs::journalreplayer
that knows how to replay journal records. - An implementation of a system call
sys_readdiskfile
that usesbufcache
,chkfsstate
, andchkfs_fileiter
to read disk files. - Programs
p-readdiskfile.cc
andp-wcdiskfile.cc
that usesys_readdiskfile
to examine disk files.
You'll see that init_hardware
already initializes the local
SATA disk:
// initialize SATA drive
sata_disk = ahcistate::find();
if (sata_disk && sata_disk->irq_ > 0) {
cpus[ncpu - 1].enable_irq(sata_disk->irq_);
}
Try running make run-readdiskfile
! You will see that reading disk files
is painfully slow. This is because Chickadee’s buffer cache and disk
programming interfaces are woefully underpowered. In this problem set,
you will fix the buffer cache and ahcistate
; hook the file system up
to your VFS from problem set 3; and add write support.
A. Buffer cache
In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, and it should also contain blocks that will likely be used in the future.
To implement a good buffer cache, you will need an eviction policy, which
decides which (unreferenced) block to evict when the cache is full but a new
block should be added. LRU (least recently used) is a good first choice, but
there are other good choices, such as policies that treat different kinds of
blocks differently. For example, it’s probably a good idea to always keep
the superblock in memory—or to introduce a separate variable, such as a
chkfsstate
member variable, that holds the superblock’s data. And it might
be a good idea to treat inode blocks differently from data blocks.
You will also want to raise the number of entries in the buffer cache, although your buffer cache should work for any number of entries greater than or equal to 10.
Your buffer cache must obey the invariant that only unreferenced blocks can
be evicted. That is, any block with bcentry::ref_ > 0
must not be evicted.
It must also obey the buffer cache synchronization invariants. You may
change the synchronization invariants if you’d like, but make sure to avoid
undefined behavior.
You will also want a prefetching policy, which decides which blocks to read in advance of their being needed.
Before working too much on the code, you should specify your policies in a short text document and discuss your policies with a TF. Among other things, your write-up should discuss how you will synchronize between prefetching activity and eviction activity.
Test your work first with make run-readdiskfile
, which reads a very short
file one byte at a time, and with make run-wcdiskfile
, which reads longer
files 125 bytes at a time. You can also run the readdiskfile
and
wcdiskfile
programs with arguments after make run-sh
.
Prefetching hints
Do eviction first. You must do eviction before progressing to Part B, but you can delay prefetching until much later—in fact, you can delay it until the rest of the problem set is done. Prefetching will require more code surgery than eviction.
You will need to change the ahcistate::read_or_write
function to use more
than just command slot 0. You also don’t want a process to block just because
its kernel task is prefetching data! You could implement a separate kernel
task that just does prefetching. But there are other possibilities. Consider,
for example, adding a int fetch_status_
to bcentry
, and adding a new
function to ahcistate
a signature something like this:
// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
volatile int& fetch_status, /* maybe more arugments */);
You can change the existing read_or_write
function to call
read_or_write_nonblocking
.
The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!
B. VFS integration and reads
In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.
Background: Chickadeefs in-memory inodes
Chickadeefs inodes have a built-in synchronization plan that you should
understand. This plan involves the buffer cache and the inode field mlock
(which is only meaningful in memory).
The chkfsstate::get_inode
and put_inode
functions manipulate buffer cache
reference counts to ensure that the returned inode remains in memory. That
means that callers must call put_inode
to “free” an inode when done with
it.
The mlock
field is a read/write lock for the inode’s contents. The lock can
be held for either reading or writing. While the read lock is held, the
inode’s contents cannot change; the write lock is required to modify the
corresponding file (e.g., to add blocks to it or modify its contents). mlock
should not be held across system calls: it should be acquired immediately before a kernel
read or write operation, and released immediately after. Do not, for
example, call inode::lock_write
simply because a process opened the
corresponding file for writing. mlock
is manipulated by inode::lock_read
,
inode::lock_write
, inode::unlock_read
, and inode::unlock_write
. We
provide these functions for you.
mlock
is cleared to 0 when an inode block is read from disk; see
clean_inode_block()
.
execv
First, change your execv
implementation to use the disk file system. Model
your code on syscall_readdiskfile
. Use make run-execallocexit
—or make run-sh
, followed by some commands like echo foo
—to test your work.
You’ll likely define a new derived class of proc_loader
that calls
chkfs_fileiter::get_disk_entry
and bcentry::put
.
VFS
Next, add a new VFS type for Chickadee file system files. You will add a new
derived type of vnode. Most likely, its constructor will take a
chkfs::inode*
argument and its operations will use chkfs_fileiter
; closing
the vnode will put
the corresponding inode. Remember to think about your
synchronization plan.
open
Finally, change your open
implementation to read the Chickadee file system
rather than memfile
s. (Don’t support OF_WRITE
or OF_TRUNC
yet.)
Use make run-sh
, and then cat thoreau.txt
, to check your work. (The
thoreau.txt
file is included only on the disk file system.)
C. Simple writes and writeback
In the next parts, you’ll extend your Chickadee file system VFS to support writes. We have divided write support into multiple subparts. You don’t need to worry about file system crash safety yet. Read all of the Part C description before starting your design and implementation!
Writes
First, support opening disk files for writing (i.e., the OF_WRITE
flag
without OF_TRUNC
or OF_CREATE
), and support writing to such files.
You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.
In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks.
Use make run-testwritefs
to test this functionality.
Writeback
Next, support writing data back to the disk via the sys_sync
system call,
which is implemented by bufcache::sync()
.
Change the bufcache::sync()
function to write all “dirty buffers” (i.e.,
buffer cache entries whose data is modified relative to the disk version) back
to disk.
Use make cleanfs run-testwritefs
to test this functionality.
The
cleanfs
command rebuilds the disk image, which matters becausetestwritefs
assumes it starts with a clean disk image.
Writeback hints
Basics: You’ll need to track which buffers are dirty, meaning modified
in memory relative to the on-disk version. For example, you could add a state
bcentry::estate_t state_dirty
that is set when data is modified and cleared when
modifications are written back to disk. Your eviction policy from Part A must
not evict dirty entries.
Synchronization: The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes.
A simple synchronization method that works is a per-entry write reference count that can be either zero or one. Complete these functions:
// Obtain a write reference for this entry. Blocks until there are no other
// write references, then obtains a reference and returns. (Can also be a
// good place to maintain dirty flags.)
void bcentry::get_write();
// Release the write reference obtained by a prior call to `get_write()`.
void bcentry::put_write();
Your VFS will obtain a write reference before modifying a block, and release
that reference afterward. (The VFS should not hold write references long
term.) Your bufcache::sync
function will likewise obtain write references.
You may also want to add one or more wait queues to facilitate efficient
blocking.
Note that a write reference must not prevent other kernel tasks from holding normal references to the same entry (it’s not a readers/writer lock).
Dirty lists: A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or bufentries.)
for each bcentry in the buffer cache {
if (bcentry is dirty) {
get write reference;
write bcentry to disk, blocking until write is complete;
set state to clean and put write reference;
}
}
But walking the whole buffer cache takes O(bufcache::ne
)
time, even if very little cached data is dirty. It’s faster to maintain
a linked list of dirty bcentry
structures. The writeback procedure would
look something like this (locking still elided):
while (bcentry* e = dirty_list_.pop_front()) {
get write reference;
write bcentry to disk, blocking until write is complete;
set state to clean and put write reference;
}
But what if other processes keep writing to new blocks while the writeback
procedure is running? The writeback procedure might never reach the end of the
dirty_list_
! A simple swapping trick avoids the problem:
list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcentry* e = mydirty.pop_front()) {
// ...
}
This trick might also let you release a long-term lock on the buffer cache during the sync process.
Advanced optional work: Parallel writes. All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!
Invariants and concurrency in the buffer cache and the file system
When thinking about synchronization in pset4, it's helpful to consider that synchronization invariants exist at multiple levels in the kernel.
-
The buffer cache (i.e.,
bcentry
andbufcache
) deal with disk blocks. At the buffer cache level, synchronization is needed to ensure that (for example), at any given time, a particular disk block can exist in at most one location in the buffer cache. The buffer cache synchronization invariants are described here. -
You can think of the file system proper (e.g.,
chkfsstate
andchkfs_fileiter
and your Chickadee VFS implementation) as a separate piece of software that leverages the buffer cache to create higher-level semantics over disk blocks. The buffer cache is unenlightened about whether a disk block contains a superblock or inode or data block; in contrast, the file system knows about these entities, and has to enforce invariants involving those entities. An example of such an invariant is that, at any given moment, an inode can be associated with multiple concurrent reads of the file's data, or a single writer of that file's data.
So, let's consider what happens when writing a ChickadeeFS file via
the VFS layer. Ignoring a few details, the Chickadee vnode
might be
initialized with a reference to the appropriate inode
. That reference
would be generated by a call to chkfsstate::get_inode(inum_t inum)
.
Inside chkfsstate::get_inode(inum_t inum)
, several calls are made to
bufcache::get_disk_entry()
, and eventually this code executes:
auto inode_entry = bc.get_disk_entry(bn, clean_inode_block);
ino = reinterpret_cast<inode*>(inode_entry->buf_);
return ino;
Note that the call to bc::get_disk_entry()
involved the maintenance of
buffer-cache-level invariants, such as the the buffer now having a
positive bcentry::ref_ count.
This buffer-cache-level invariant
involving reference counts is helpful for implementing
buffer-cache-level behaviors like dropping buffer cache entries: an
entry is safe to drop if it has a reference count of zero. This invariant
doesn't require the buffer cache to know what is inside a particular
disk block (e.g., an inode or a data block or whatever).
So, back to the issue of a ChickadeeFS vnode
trying to write a file.
The vnode
has a reference to the appropriate inode
. The vnode
's
write()
implementation must do the following stuff:
-
Call
inode::lock_write()
to enforce the file-system-level invariant that no concurrent reads or writes can be happening to a file. -
Do some calculations to determine which disk block to issue the write to; if you're extending the file, you'll have to perform some extent management.
-
Get a pointer to the appropriate disk block to update, e.g., using
chkfs_fileiter::get_disk_entry().
[Strong hint:chkfs_fileiter
is your friend!] The only way to get a pointer to a disk block is to go through the buffer cache directly or indirectly (e.g., viachkfs_fileiter
). So, the pointer to the disk block to write to is abcentry*
. Thebcentry
will already have a positive reference count because we fetched it via the buffer cache interface (which properly increments reference counts). However, we will need to decrement the reference count at some point :-D. -
Call
bcentry::get_write()
to get exclusive write access to the block. You have to implement the code inside ofbcentry::get_write()
. At a high level, it needs to associate awrite_ref
flag with thebcentry
, such that the write reference can be taken (and thewrite_ref
flag flipped) only if the flag is not currently held; otherwise, the task should block until theflag
can be flipped. Note that acquisition of a write reference should not prevent other processes from getting normal references to thebcentry
via buffer cache interfaces! This is what the pset4 description means when it says that "a write reference must not prevent other kernel tasks from holding normal references to the same entry (it’s not a readers/writer lock)." [Something to consider: How does callingbcentry::get_write()
interact with a concurrent attempt by another process to callbufcache::sync()
?] -
Now that you have a write reference, you can do what needs to be done to update the buffer. So, do this stuff.
-
Release the write reference on the buffer by calling
bcentry::put_write()
. [Things to consider: Can the buffer cache now flush thebcentry
to disk? Can another process issue a write to the file, or does that process have to wait for something else to happen?] -
Call
bcentry::put()
to decrement the reference count (which is important for maintaining buffer-cache-level semantics). -
Call
inode::unlock_write().
Nota bene that, depending on your implementation, you will likely need to do additional stuff than what is described above. However, this overview hopefully provides more insights into the interactions between the buffer cache and the file system code.
Debugging writeback errors
Some writeback errors, such as checksum failures, can be hard to debug from
within the operating system. Use our helper programs to investigate the disk’s
state from outside. For instance, say you experience a checksum failure on
file thoreau.txt
. Try the following command:
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less
The -e thoreau.txt
option tells chickadeefsck
to search the file system
image for the thoreau.txt
file, and write its contents to standard output.
Look through the contents; do they look reasonable? For instance, do they
contain any weird-looking sequences of ^@
(null characters)?
If the contents look unreasonable, try using hexdump
to drill down further.
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less
Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?
Use the -V
option to investigate the file system’s specific layout,
including the extents used by each file. This can help you discover which data
block corresponds to some offset in a file.
$ obj/chickadeefsck -V chickadeefs.img
journal: no metablocks found
inode 1 @root directory: size 4096, type directory, nlink 1
[0]: extent 177+1
#0 "kernel": inode 2
#1 "emerson.txt": inode 3
#2 "dickinson.txt": inode 4
#3 "thoreau.txt": inode 5
...
#31 "wcdiskfile": inode 33
inode 2 @kernel: size 355512, type data, nlink 1
[0]: extent 16+87
inode 3 @emerson.txt: size 130, type data, nlink 1
[0]: extent 103+1
inode 4 @dickinson.txt: size 649, type data, nlink 1
[0]: extent 104+1
inode 5 @thoreau.txt: size 685, type data, nlink 1
[0]: extent 105+1
inode 6 @wheatley.txt: size 1958, type data, nlink 1
[0]: extent 106+1
inode 7 @allocator: size 6432, type data, nlink 1
[0]: extent 107+2
...
D. File sizes and extension
In this section, you’ll support making files bigger and smaller and seeking
within files. Use make cleanfs run-testwritefs2
to test.
OF_TRUNC
Implement the OF_TRUNC
flag to sys_open
. When OF_TRUNC
and OF_WRITE
are both given, the file size should be set to zero.
This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.
sys_lseek
Implement the sys_lseek
system call, which changes a file’s position and
returns the new position.
Chickadee’s sys_lseek
should behave generally like Linux’s lseek
.
sys_lseek(fd, off, LSEEK_SET)
sets the file position to off
;
sys_lseek(fd, off, LSEEK_CUR)
should adjust the file position by off
; and
sys_lseek(fd, off, LSEEK_END)
should set the file position to the end of the
file plus off
. In all these cases, the system call returns the new file
position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE)
, which
returns the file’s size without changing the position (off
is ignored).
Return E_SPIPE
for unseekable files, such as pipes and the keyboard/console.
Unlike Linux, you only need to support seeks within a file for now: you may
return E_INVAL
for seeks outside a file’s allocated size.
File extension
Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!
Use make cleanfs run-testwritefs2 fsck
to test. When testwritefs2
completes successfully, press q
to quit QEMU; then Chickadee’s file system
checker should run. You should see something like this in the terminal:
kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
RM chickadeefs.img
CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
QEMU chickadeeboot.img
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
unknown keycodes `(unnamed)', please report to qemu-devel@nongnu.org
FSCK chickadeefs.img
* File system OK
File extension hints
-
Complete the
chkfsstate::allocate_extent()
function so it allocates a free extent. This function will load the free block bitmap into the buffer cache; find a contiguous range of 1 bits (indicating a free extent), likely usingbitset_view
; set those bits to 0; and return the first block number in the extent.Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 215 blocks (128 MiB). It’s OK to not handle larger file systems.
-
Your buffer cache may fill up with dirty blocks. One way to handle this case: if
bufcache::get_disk_entry
cannot allocate an entry but observed an unreferenced dirty entry, it could release its locks, callsync
, and then try again.
E. Creating files
Add support for the OF_CREATE
flag to sys_open
. When specified in
combination with OF_WRITE
, this flag should create a file of the supplied
name if no such file exists.
To create a file, you’ll need to:
- Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
- Allocate an inode.
- Store the inode number and filename in the direntry.
- Mark the inode as allocated (set its type, size, and
nlink
count).
Use make cleanfs run-testwritefs3 fsck
to test your work. You should also
now be able to do cool things like create new files from your shell.
F. Optional features
Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.
- Add support for the
sys_unlink
system call, which removes a file. Usemake cleanfs run-testwritefs4 fsck
to test your work. - Add support for the
sys_rename
system call, which changes a file’s name. - Add subdirectory support:
sys_mkdir
andsys_rmdir
system calls, and support for moving files between directories. - Add support for special kinds of file, such as
/dev/null
or/dev/random
.
To receive extra credit for implementing one or more of these features,
you must create new test programs (e.g., p-testmkrmdir.cc
) which
demonstrate that your kernel code is correct!
Turnin
Fill out psets/pset4answers.md
and psets/pset4collab.md
and push to
GitHub. Then submit on the grading server.
Checkin: You should complete all parts by 11:59pm on Monday 4/4.