Handout
Our handout code offers:
- A file system definition (
chickadeefs.hh
, described here and here). - A class
ahcistate
(ink-ahci.hh
) that accesses disks that speak the modern SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can. - A class
bufcache
(ink-chkfs.hh
andk-chkfs.cc
) that implements a buffer cache. - A class
chkfsstate
(ink-chkfs.hh
andk-chkfs.cc
) that accesses thesata_disk
’s Chickadee file system via the buffer cache. - A class
chkfs_fileiter
(ink-chkfsiter.hh
) that iterates over the blocks in a ChickadeeFS file (somewhat like howvmiter
iterates over the pages in an x86-64 page table). - A class
chkfs::journalreplayer
that knows how to replay journal records. - A system call implementation of a new system call,
sys_readdiskfile
, that usesbufcache
,chkfsstate
, andchkfs_fileiter
to read disk files. - Programs
p-readdiskfile.cc
andp-wcdiskfile.cc
that usesys_readdiskfile
to examine disk files.
Before you do anything else, uncomment this line in init_hardware
:
/* sata_disk = ahcistate::find(); */
Then give make run-readdiskfile
a try!
You will see that reading disk files is woefully slow. This is because
Chickadee’s buffer cache and disk programming interfaces are woefully
underpowered. In this problem set, you will fix the buffer cache and
ahcistate
; hook the file system up to your VFS from problem set 3;
and add write support.
A. Buffer cache
In this part, you’ll fix the buffer cache so it can speed up read requests. The current buffer cache immediately evicts all unreferenced blocks. Your buffer cache should not evict immediately, and it should also contain blocks that will likely be used in the future.
To implement a good buffer cache, you will need an eviction policy, which
decides which (unreferenced) block to evict when the cache is full but a new
block should be added. LRU (least recently used) is a good first choice, but
there are other good choices, such as policies that treat different kinds of
blocks differently. For example, it’s probably a good idea to always keep
the superblock in memory—or to introduce a separate variable, such as a
chkfsstate
member variable, that holds the superblock’s data. And it might
be a good idea to treat inode blocks differently from data blocks.
You will also want to raise the number of entries in the buffer cache, although your buffer cache should work for any number of entries greater than or equal to 10.
Your buffer cache must obey the invariant that only unreferenced blocks can
be evicted. That is, any block with bcentry::ref_ > 0
must not be evicted.
It must also obey the buffer cache synchronization invariants. You may
change the synchronization invariants if you’d like, but make sure to avoid
undefined behavior.
You will also want a prefetching policy, which decides which blocks to read in advance of their being needed.
Before working too much on the code, you should specify your policies in a short text document and talk over your policies with a TF.
Test your work first with make run-readdiskfile
, which reads a very short
file one byte at a time, and with make run-wcdiskfile
, which reads longer
files 125 bytes at a time. You can also run the readdiskfile
and
wcdiskfile
programs with arguments after make run-sh
.
Prefetching hints
Do eviction first. You must do eviction before progressing to Part B, but you can delay prefetching until much later—in fact, you can delay it until the rest of the problem set is done. Prefetching will require more code surgery than eviction.
You will need to change the ahcistate::read_or_write
function to use more
than just command slot 0. You also don’t want a process to block just because
its kernel task is prefetching data! You could implement a separate kernel
task that just does prefetching. But there are other possibilities. Consider,
for example, adding a int fetch_status_
to bcentry
, and adding a new
function to ahcistate
a signature something like this:
// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz, size_t off,
volatile int& fetch_status, /* maybe more arugments */);
You can change the existing read_or_write
function to call
read_or_write_nonblocking
.
The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!
B. VFS integration and reads
In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.
Background: Chickadeefs in-memory inodes
Chickadeefs inodes have a built-in synchronization plan that you should
understand. This plan involves the buffer cache and two inode fields, mlock
and mref
, that are only meaningful in memory.
The chkfsstate::get_inode
and put_inode
functions manipulate buffer cache
reference counts to ensure that the returned inode remains in memory. That
means that callers must call put_inode
to “free” an inode when done with
it.
The mlock
field is a read/write lock for the inode’s contents. The lock can
be held for either reading or writing. While the read lock is held, the
inode’s contents cannot change; the write lock is required to modify the
corresponding file (e.g., to add blocks to it or modify its contents). mlock
should not be held across system calls: it should be acquired immediately before a kernel
read or write operation, and released immediately after. Do not, for
example, call inode::lock_write
simply because a process opened the
corresponding file for writing. mlock
is manipulated by inode::lock_read
,
inode::lock_write
, inode::unlock_read
, and inode::unlock_write
. We
provide these functions for you.
The mref
field is available for your use. You likely will not need it until
you implement more advanced features, but if you need it now, it’s available
for you. For instance, you could use mref
to store the address of the
in-memory vnode corresponding to the inode. (mref
is only 32 bits wide, so
you might store the address with ino->mref = uint32_t(ka2pa(vnode))
and
extract the address with something like pa2ka<vnode_type*>(ino->mref)
.) Or
you could use it as a reference count, indicating that some operating system
task or structure holds a reference to the inode and that the inode should not
be deleted or reused. (You’d add ++ino->mref
to chkfsstate::get_inode
and
--ino->mref
to chkfs::inode::put
.)
Both mlock
and mref
are cleared to 0 when an inode block is read from
disk.
execv
First, change your execv
implementation to use the disk file system. Model
your code on syscall_readdiskfile
. Use make run-execallocexit
—or make
run-sh
, followed by some commands like echo foo
—to test your work.
You’ll likely define a new derived class of proc::loader
that calls
chkfs_fileiter::get_disk_entry
and bcentry::put
.
VFS
Next, add a new VFS type for Chickadee file system files. You will add a new
derived type of vnode. Most likely, its constructor will take a
chkfs::inode*
argument and its operations will use chkfs_fileiter
; closing
the vnode will put
the corresponding inode. Remember to think about your
synchronization plan.
open
Finally, change your open
implementation to read the Chickadee file system
rather than memfile
s. (Don’t support OF_WRITE
or OF_TRUNC
yet.)
Use make run-sh
, and then cat thoreau.txt
, to check your work. (The
thoreau.txt
file is included only on the disk file system.)
C. Simple writes and writeback
In the next parts, you’ll extend your Chickadee file system VFS to support writes. We have divided write support into multiple subparts. You don’t need to worry about file system crash safety yet.
Writes
First, support opening disk files for writing (i.e., the OF_WRITE
flag
without OF_TRUNC
or OF_CREATE
), and support writing to such files.
You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.
In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks.
Use make run-testwritefs
to test this functionality.
Writeback
Next, support writing data back to the disk via the sys_sync
system call,
which is implemented by bufcache::sync()
.
Change the bufcache::sync()
function to write all “dirty buffers” (i.e.,
buffer cache entries whose data is modified relative to the disk version) back
to disk.
Use make cleanfs run-testwritefs
to test this functionality.
The
cleanfs
command rebuilds the disk image, which matters becausetestwritefs
assumes it starts with a clean disk image.
Writeback hints
Basics: You’ll need to track which buffers are dirty, meaning modified
in memory relative to the on-disk version. For example, you could add a state
bcentry::state_dirty
that is set when data is modified and cleared when
modifications are written back to disk. Your eviction policy from Part A must
not evict dirty entries.
Synchronization: The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes.
A simple synchronization method that works is a per-entry write reference count that can be either zero or one. Consider completing these functions:
// Obtain a write reference for this entry. Blocks until there are no other
// write references, then obtains a reference and returns. (Can also be a
// good place to maintain dirty flags.)
void bcentry::get_write();
// Release the write reference obtained by a prior call to `get_write()`.
void bcentry::put_write();
Your VFS will obtain a write reference before modifying a block, and release
that reference afterward. (The VFS should not hold write references long
term.) Your bufcache::sync
function will likewise obtain write references.
You may also want to add one or more wait queues to facilitate efficient
blocking.
Note that a write reference must not prevent other kernel tasks from holding normal references to the same entry (it’s not a readers/writer lock).
If you want to use a different synchronization plan than get/put_write
, you
will need to change chkfs_fileiter
.
Dirty lists: A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or bufentries.)
for each bcentry in the buffer cache {
if (bcentry is dirty) {
get write reference;
write bcentry to disk, blocking until write is complete;
set state to clean and put write reference;
}
}
But walking the whole buffer cache takes O(bufcache::ne
)
time, even if very little cached data is dirty. It’s faster to maintain
a linked list of dirty bcentry
structures. The writeback procedure would
look something like this (locking still elided):
while (bcentry* e = dirty_list_.pop_front()) {
get write reference;
write bcentry to disk, blocking until write is complete;
set state to clean and put write reference;
}
But what if other processes keep writing to new blocks while the writeback
procedure is running? The writeback procedure might never reach the end of the
dirty_list_
! A simple swapping trick avoids the problem:
list<...> mydirty;
mydirty.swap(dirty_list_);
// Now `dirty_list_` is empty; `mydirty` contains its previous contents!
// Any newly-dirty blocks will be put onto `dirty_list_`. `mydirty`, a
// local variable, will not grow.
while (bcentry* e = mydirty.pop_front()) {
// ...
}
This trick might also let you release a long-term lock on the buffer cache during the sync process.
Parallel writes: All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!
Debugging writeback errors
Some writeback errors, such as checksum failures, can be hard to debug from
within the operating system. Use our helper programs to investigate the disk’s
state from outside. For instance, say you experience a checksum failure on
file thoreau.txt
. Try the following command:
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | less
The -e thoreau.txt
option tells chickadeefsck
to search the file system
image for the thoreau.txt
file, and write its contents to standard output.
Look through the contents; do they look reasonable? For instance, do they
contain any weird-looking sequences of ^@
(null characters)?
If the contents look unreasonable, try using hexdump
to drill down further.
$ obj/chickadeefsck -e thoreau.txt chickadeefs.img | hexdump -C | less
Do problems appear to start on a page boundary (an offset multiple of 0x1000 == 4096)? If so, at which offset? Can you add printouts to your write code that trigger near that problematic offset?
Use the -V
option to investigate the file system’s specific layout,
including the extents used by each file. This can help you discover which data
block corresponds to some offset in a file.
$ obj/chickadeefsck -V chickadeefs.img
journal: no metablocks found
inode 1 @root directory: size 4096, type directory, nlink 1
[0]: extent 177+1
#0 "kernel": inode 2
#1 "emerson.txt": inode 3
#2 "dickinson.txt": inode 4
#3 "thoreau.txt": inode 5
...
#31 "wcdiskfile": inode 33
inode 2 @kernel: size 355512, type data, nlink 1
[0]: extent 16+87
inode 3 @emerson.txt: size 130, type data, nlink 1
[0]: extent 103+1
inode 4 @dickinson.txt: size 649, type data, nlink 1
[0]: extent 104+1
inode 5 @thoreau.txt: size 685, type data, nlink 1
[0]: extent 105+1
inode 6 @wheatley.txt: size 1958, type data, nlink 1
[0]: extent 106+1
inode 7 @allocator: size 6432, type data, nlink 1
[0]: extent 107+2
...
D. File sizes and extension
In this section, you’ll support making files bigger and smaller and seeking
within files. Use make cleanfs run-testwritefs2
to test.
OF_TRUNC
Implement the OF_TRUNC
flag to sys_open
. When OF_TRUNC
and OF_WRITE
are both given, the file size should be set to zero.
This step does not actually require that you free any extents. A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.
sys_lseek
Implement the sys_lseek
system call, which changes a file’s position and
returns the new position.
Chickadee’s sys_lseek
should behave generally like Linux’s lseek
.
sys_lseek(fd, off, LSEEK_SET)
sets the file position to off
;
sys_lseek(fd, off, LSEEK_CUR)
should adjust the file position by off
; and
sys_lseek(fd, off, LSEEK_END)
should set the file position to the end of the
file plus off
. In all these cases, the system call returns the new file
position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE)
, which
returns the file’s size without changing the position (off
is ignored).
Return E_SPIPE
for unseekable files, such as pipes and the keyboard/console.
Unlike Linux, you only need to support seeks within a file for now: you may
return E_INVAL
for seeks outside a file’s allocated size.
File extension
Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!
Use make cleanfs run-testwritefs2 fsck
to test. When testwritefs2
completes successfully, press q
to quit QEMU; then Chickadee’s file system
checker should run. You should see something like this in the terminal:
kohler@ubuntu:~/chickadee$ make cleanfs run-testwritefs2 fsck
RM chickadeefs.img
CREATE chickadeefs.img
* Run `gdb -x build/chickadee.gdb` to connect gdb to qemu.
QEMU chickadeeboot.img
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
unknown keycodes `(unnamed)', please report to qemu-devel@nongnu.org
FSCK chickadeefs.img
* File system OK
File extension hints
Complete the
chkfsstate::allocate_extent()
function so it allocates a free extent. This function will load the free block bitmap into the buffer cache; find a contiguous range of 1 bits (indicating a free extent), likely usingbitset_view
; set those bits to 0; and return the first block number in the extent.Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 215 blocks (128 MiB). It’s OK to not handle larger file systems.
Your buffer cache may fill up with dirty blocks. One way to handle this case: if
bufcache::get_disk_entry
cannot allocate an entry but observed an unreferenced dirty entry, it could release its locks, callsync
, and then try again.
E. Creating files
Add support for the OF_CREATE
flag to sys_open
. When specified in
combination with OF_WRITE
, this flag should create a file of the supplied
name if no such file exists.
To create a file, you’ll need to:
- Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
- Allocate an inode.
- Store the inode number and filename in the direntry.
- Mark the inode as allocated (set its type, size, and
nlink
count).
Use make cleanfs run-testwritefs3 fsck
to test your work. You should also
now be able to do cool things like create new files from your shell.
F. Optional features
Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.
- Add support for the
sys_unlink
system call, which removes a file. Usemake cleanfs run-testwritefs4 fsck
to test your work. - Add support for the
sys_rename
system call, which changes a file’s name. - Add subdirectory support:
sys_mkdir
andsys_rmdir
system calls, and support for moving files between directories. - Add support for special kinds of file, such as
/dev/null
or named pipes.
Turnin
Fill out psets/pset4answers.md
and psets/pset4collab.md
and push to
GitHub. Then submit on the grading server.
Checkin: You should be able to complete all parts by 11:59pm Monday 4/8.