Handout
Our handout code offers:
- A file system definition (
chickadeefs.hh
, described here). - A class
chickadeefs::journalreplayer
that knows how to replay journal records. - A class
ahcistate
(ink-devices.hh
) that interfaces to disks that speak the modern SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can. - A class
bufcache
(ink-chkfs.hh
andk-chkfs.cc
) that implements a buffer cache. - A class
chkfsstate
(ink-chkfs.hh
andk-chkfs.cc
) that accesses thesata_disk
’s Chickadee file system via the buffer cache. - A class
chkfs_fileiter
(ink-chkfsiter.hh
) that iterates over the blocks in a ChickadeeFS file (somewhat like howvmiter
iterates over the pages in an x86-64 page table). - A function
chickadeefs_read_file_data
(defined ink-chkfs.cc
) that usesbufcache
,chkfsstate
, andchkfs_fileiter
to read disk files. - A system call implementation of a new system call,
sys_readdiskfile
, that callschickadeefs_read_file_data
. - A program
p-readdiskfile.cc
that usessys_readdiskfile
to read the contents of a disk file.
Before you do anything else, uncomment this line in init_hardware
:
/* sata_disk = ahcistate::find(); */
Then give make run-readdiskfile
a try!
You will see that reading disk files is woefully slow. This is because
Chickadee’s buffer cache and disk programming interfaces are woefully
underpowered. In this problem set, you will fix the buffer cache and
ahcistate
; hook the file system up to your VFS from problem set 3;
and add write support.
A. Buffer cache
In this part, you’ll make a real buffer cache capable of speeding up read requests. The current buffer cache only contains blocks that are currently being used: as soon as a block becomes unreferenced, it is evicted. Your buffer cache should not evict immediately, and it should also contain blocks that will likely be used in the future.
To implement a good buffer cache, you will need an eviction policy, which
decides which (unreferenced) block to evict when the cache is full but a new
block should be added. LRU (least recently used) is a good first choice, but
there are other good choices, such as policies that treat different kinds of
blocks differently. (For example, it’s probably a good idea to always keep
the superblock in memory—or to introduce a separate variable, such as a
chkfsstate
member variable, that holds the superblock’s data. And it might
be a good idea to treat inode blocks differently from data blocks.)
You will also want a prefetching policy, which decides which blocks to read in advance of their being needed.
You will also want to raise the number of entries in the buffer cache, although your buffer cache should work for any number of entries greater than or equal to 10.
Your buffer cache must obey the invariant that only unreferenced blocks can
be evicted. That is, any block with bufentry::ref_ > 0
must not be evicted.
Before working too much on the code, you should specify your policies in a short text document and talk over your policies with a TF.
Test your work first with make run-readdiskfile
, which reads a very short
file one byte at a time, and with make run-wcdiskfile
, which reads longer
files 125 bytes at a time. You can also run the readdiskfile
and
wcdiskfile
programs with arguments after make run-sh
.
Hints
Prefetching will require more code surgery than eviction. You will need to
change the ahcistate::read_or_write
function to use more than just command
slot 0. You also don’t want a process to block just because its kernel task
is prefetching data!
You could implement a separate kernel task that just does prefetching. But
there are other possibilities. Consider, for example, adding a volatile
int fetch_status_
to bufentry
, and adding a new function to ahcistate
with the following signature:
// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz,
size_t off, volatile int* status);
You can change the existing read_or_write
function to call
read_or_write_nonblocking
.
The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!
B. VFS integration
In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.
Background: Chickadeefs in-memory inodes
Chickadeefs inodes have a built-in synchronization plan that you should
understand. This plan involves the buffer cache and two inode fields, mlock
and mref
, that are only meaningful in memory.
The chkfsstate::get_inode
and put_inode
functions manipulate buffer cache
reference counts to ensure that the returned inode remains in memory. That
means that callers must call put_inode
to “free” an inode when done with
it.
The mlock
field is a read/write lock for the inode’s contents. The lock can
be held for either reading or writing. While the read lock is held, the
inode’s contents cannot change; the write lock is required to modify the
corresponding file (e.g., to add blocks to it or modify its contents).
mlock
should not be held for long; it should be acquired immediately before
a kernel read or write operation, and released immediately after. Do not,
for example, call inode::lock_write()
simply because a process opened the
corresponding file for writing. mlock
is manipulated by inode::lock_read
,
inode::lock_write
, inode::unlock_read
, and inode::unlock_write
.
The mref
field is available for your use. For instance, you can use mref
to store the address of the in-memory vnode corresponding to the inode.
(mref
is only 32 bits wide, so you might store the address with ino->mref =
uint32_t(ka2pa(vnode))
and extract the address with something like
pa2ka<vnode_type*>(ino->mref)
.) Or you can use it as a reference count,
indicating that some operating system task or structure holds a reference to
the inode and that the inode should not be deleted or reused. (You’d add
++ino->mref
to chkfsstate::get_inode
and --ino->mref
to
chkfsstate::put_inode
).
Both fields are cleared to 0 when an inode block is read.
execv
First, change your execv
implementation to use the disk file system, rather
than memfile
s. Use make run-execallocexit
—or make run-sh
, followed by
some commands like echo foo
—to test your work.
We’ve made some changes to the process loader interface to make this easier.
You’ll likely define a new derived class of proc::loader
; your derived
class will call chkfsstate::get_data_block
and bufcache::put_block
.
VFS
Next, add a new VFS type for Chickadee file system files. You will add a new
derived type of vnode. This vnode will hold a pointer to a
chickadeefs::inode
, and will use chkfs_fileiter
and/or chkfsstate
functions to read from it. Closing the vnode will put_inode
the
corresponding inode.
open
Finally, change your open
implementation to read the Chickadee file system
rather than memfile
s.
Use make run-sh
, and then cat thoreau.txt
, to check your work. (The
thoreau.txt
file is included only on the disk file system.)
C. Simple writes and writeback
In the next parts, you’ll extend your Chickadee file system VFS to support writes. We have divided write support into multiple subparts, each of which should take several days. You don’t need to worry too much about file system crash safety yet.
Writes
First, support opening disk files for writing (i.e., the OF_WRITE
flag
without OF_TRUNC
or OF_CREAT
), and support writing to such files.
You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.
In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks.
The initial part of make run-testwritefs
tests this functionality.
Writeback
Next, support writing data back to the disk via the sys_sync
system call,
which is implemented by bufcache::sync()
.
Change the bufcache::sync()
function to write all “dirty buffers” (i.e.,
buffer cache entries whose data is modified relative to the disk version) back
to disk, at the proper locations.
Use make cleanfs run-testwritefs
to test this functionality.
The
cleanfs
command rebuilds the disk image, which matters becausetestwritefs
assumes it starts with a clean disk image.
Writeback hints
Basics: You’ll need to track which buffers are dirty, meaning modified
in memory relative to the on-disk version. For example, you could add a flag
bufentry::f_dirty
that is set when data is modified and cleared when
modifications are written back to disk. Your eviction policy from Part A must
not evict dirty buffers.
Synchronization: The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes.
A simple synchronization method that works is a per-bufentry write reference count that can be either zero or one. Consider completing these functions:
// Obtain a write reference on `e`. Blocks until there are no other write
// references, then obtains a reference. (Can also be a good place to
// maintain dirty flags.)
void bufcache::get_write(bufentry* e);
// Release the write reference on `e` obtained by a prior call to `get_write(e)`.
void bufcache::put_write(bufentry* e);
Your VFS will obtain a write reference before modifying a block, and release
that reference afterward. (The VFS should not hold write references long
term.) Your bufcache::sync
function will likewise obtain write references.
You may also want to add one or more wait queues to facilitate efficient
blocking.
Note that a write reference should not prevent other kernel tasks from holding read references to the entry.
If you want to use a different synchronization plan than get/put_write
, you
will need to change chkfs_fileiter
.
Dirty lists: A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or bufentries.)
for each bufentry in the buffer cache {
if (bufentry is dirty) {
get write reference;
write bufentry to disk, blocking until write is complete;
clear f_dirty flag and put write reference;
}
}
But walking the whole buffer cache takes O(bufcache::ne
)
time, even if very little cached data is dirty. It’s faster to maintain
a linked list of dirty bufentry
structures. The writeback procedure would
look something like this (locking still elided):
while (bufentry* e = dirty_list_.pop_front()) {
get write reference;
write bufentry to disk, blocking until write is complete;
clear f_dirty flag and put write reference;
}
But what if other processes keep writing to new blocks while the writeback
procedure is running? The writeback procedure might never reach the end of the
dirty_list_
! A simple swapping trick avoids the problem:
list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`; `mydirty`, a
// local variable, will not grow.
while (bufentry* e = mydirty.pop_front()) {
// ...
}
This trick might also let you release a long-term lock on the buffer cache during the sync process.
Parallel writes: All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!
D. File sizes and extension
In this section, you’ll support making files bigger and smaller and seeking
within files. Use make cleanfs run-testwritefs2
to test.
OF_TRUNC
Implement the OF_TRUNC
flag to sys_open
. When OF_TRUNC
and OF_WRITE
are both given, the file size should be set to zero.
This step does not actually require that you free any data blocks! A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.
sys_lseek
Implement the sys_lseek
system call, which changes a file’s position and
returns the new position.
Chickadee’s sys_lseek
should behave generally like Linux’s lseek
.
sys_lseek(fd, off, LSEEK_SET)
sets the file position to off
;
sys_lseek(fd, off, LSEEK_CUR)
should adjust the file position by off
; and
sys_lseek(fd, off, LSEEK_END)
should set the file position to the end of the
file plus off
. In all these cases, the system call returns the new file
position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE)
, which
returns the file’s size without changing the position (off
is ignored).
Return E_SPIPE
for unseekable files, such as pipes and the keyboard/console.
Unlike Linux, you only need to support seeks within a file for now.
File extension
Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!
Use make cleanfs run-testwritefs2 fsck
to test. When testwritefs2
completes successfully, press q
to quit QEMU; then Chickadee’s file system
checker should run. You should see something like this in the terminal:
kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
RM chickadeefs.img
CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
QEMU chickadeeboot.img
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
unknown keycodes `(unnamed)', please report to qemu-devel@nongnu.org
FSCK chickadeefs.img
* File system OK
File extension hints
Complete the
chkfsstate::allocate_block()
function so it allocates a free block. This function will load the free block bitmap into the buffer cache, find a bit that is 1 (indicating a free block), set that bit to 0, and return the corresponding block number.Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 215 blocks (128 MiB). It’s OK to not handle larger file systems.
Your buffer cache may fill up with dirty blocks. One way to handle this case: make
bufcache::get_disk_entry
release its locks, callsync
, and then try again if the buffer cache has any unreferenced dirty entries.
E. Creating files
Add support for the OF_CREATE
flag to sys_open
. When specified in
combination with OF_WRITE
, this flag should create a file of the supplied
name if no such file exists.
To create a file, you’ll need to:
- Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
- Allocate an inode.
- Store the inode number and filename in the direntry.
- Mark the inode as allocated (set its type, size, and
nlink
count).
Use make cleanfs run-testwritefs3 fsck
to test your work. You should also
now be able to do cool things like create new files from your shell.
F. Optional features
Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.
- Add support for the
sys_unlink
system call, which removes a file. Usemake cleanfs run-testwritefs4 fsck
to test your work. - Add support for the
sys_rename
system call, which changes a file’s name. - Add subdirectory support:
sys_mkdir
andsys_rmdir
system calls, and support for moving files between directories. - Add support for special kinds of file, such as
/dev/null
or named pipes.