ChickadeeFS, or chkfs for short, is the file system design you’ll implement in problem set 4. This page describes it.
Structure
A chkfs is comprised of 4096-byte blocks. (Thus, each chkfs block corresponds to a memory page.) A chkfs can contain up to nine kinds of block:
- The superblock.
- Swap blocks.
- Free-block-bitmap blocks.
- Inode blocks.
- Indirect blocks.
- Doubly-indirect (or indirect2) blocks.
- Data blocks.
- Directory blocks.
- Journal blocks.
Of these, only data blocks contain file data (the other kinds contain file metadata, such as file size information and directory listings). User programs can directly read and write data blocks, and can read (but not directly write) directory blocks. All other kinds of block are read and modified only by special-purpose system calls.
The chkfs groups the block types into regions, which appear in a fixed order.
- First comes exactly one superblock, in block 0.
- Then, zero or more swap blocks (zero in the handout).
- Then, one or more free-block-bitmap blocks (one in the handout).
- Then, one or more inode blocks (14 in the handout).
- Then, one or more blocks in the data region, which comprises data, indirect, indirect2, directory, and free blocks (~32000 in the handout).
- Finally, zero or more blocks in the journal region (128 in the handout).
Chickadeefs structures are declared in chickadeefs.hh
. The
structures are part of a namespace chickadeefs
, so, for example, you refer
to an inode structure using the notation chickadeefs::inode
. It may be
useful to add using namespace chickadeefs;
to a function definition or
scope; this brings all the chickadeefs
namespace definitions into scope so
you can leave off the chickadeefs::
prefix.
Superblock
The superblock contains overall file system information, such as the total
number of blocks and the location of the first inode block. It is stored in
disk block 0, chickadeefs::superblock_offset
bytes into the block (that is,
at offset 512—but you should use the symbolic constant). This is because the
first 512 bytes of the disk are reserved for the boot sector.
struct superblock {
uint64_t magic; // must equal `chickadeefs::magic`
blocknum_t nblocks; // # blocks in file system
blocknum_t nswap; // # blocks in swap space (unused)
inum_t ninodes; // # inodes in file system
blocknum_t njournal; // # journal blocks in file system
blocknum_t swap_bn; // first swap space block
blocknum_t fbb_bn; // first free block bitmap block
blocknum_t inode_bn; // first inode block
blocknum_t data_bn; // first data-area block
blocknum_t journal_bn; // first block in journal
};
Superblock components named *_bn
are block numbers. For instance, in our
handout file system, sb.fbb_bn == 1
. This indicates that the first block in
the free block bitmap is block number 1, which covers disk bytes 4096–8191
(and disk sectors 8–15).
The superblock need never be written.
Free block bitmap
The free block bitmap contains one bit per file system block, for a total of
sb.nblocks
bits. The bit for block bn
is 1 iff that block is in the data
area and is free for allocation. This function looks up the value of the bit
for block bn
:
bool fbb_lookup(blocknum_t bn) {
unsigned char* fbb = /* read FBB into contiguous memory */;
return (fbb[bn / 8] & (1 << (bn % 8))) != 0;
}
A single FBB block can represent up to 32768 blocks. Your code need not support larger chkfs file systems.
Inodes
The inode area is an array of inodes. An inode block contains space for 64
inodes; inode number inum
is located in block number sb.inode_bn +
inum/inodesperblock
, at byte offset sizeof(inode) * (inum%inodesperblock)
.
Two inode numbers are special: inode number 0 is always free (it cannot be used), and inode number 1 is the root directory.
struct inode {
uint32_t type; // file type (regular, directory, or 0/none)
uint32_t size; // file size
uint32_t nlink; // # hard links to file
std::atomic<uint32_t> mlock; // used in memory
std::atomic<uint32_t> mref; // used in memory
blocknum_t direct[ndirect]; // block pointers
blocknum_t indirect;
blocknum_t indirect2;
};
The inode::type
member is either 0, meaning the inode is free, or
type_directory
or type_file
, indicating the inode holds directory data or
file data.
inode::size
is the number of bytes in the file.
inode::nlink
is the number of hard links to the file.
inode::direct
contains 9 direct block pointers. Files of size 9 blocks or
less (i.e., 36,864 bytes or less) will use only direct block pointers. Block
number 0 means no pointer. inode::indirect
and inode::indirect2
contain
the indirect and indirect2 pointers, if any.
The inode::mlock
and inode::mref
are available for in-memory use; they are
cleared to 0 whenever an inode is loaded into memory. The current plan is
described in problem set 4.
Inode size
In the common case, inode sizes and block pointers match up. That is, if the
file size is sz
, then the file contains nb = ROUNDUP(sz, blocksize) /
blocksize
blocks, where every data-block pointer with index less than nb
is
nonzero (including both direct pointers and pointers stored in indirect and
indirect2 blocks), and every data-block pointer with index greater than or equal to
nb
is zero.
But it is also OK
for the inode’s size to be bigger or smaller than its block pointers. If some
block pointers < nb
are zero, then the file is a sparse
file. When read, missing regions
of the file should appear to contain zero bytes; when written, the file system
should allocate blocks for file data on the fly. If, on the other hand, some
block pointers >= nb
are nonzero, then the file has some spare preallocated
data in case the file is extended.
In the common case, the nlink
field is zero iff the inode is free
(i.e., type == 0
). However, in some cases, an allocated inode will have zero link
count. This happens if the last link to a file is removed while some program
has a reference to the inode. The inode should be freed and the file data
recycled only after the last reference to the inode is dropped.
Directories
A directory—that is, an inode with type == type_directory
—comprises an array
of direntry
structures. Its size must be a multiple of sizeof(direntry) ==
128
.
struct dirent {
inum_t inum; // inode # (0 = none)
char name[maxnamelen + 1]; // file name (null terminated)
};
A directory entry with inum == 0
is ignored, and entries with negative
inum
are illegal. An entry with inum > 0
says that this directory contains
a link to the given file under the specified name. A directory should
not contain duplicate names.
Journal
The journal consists
of sb.njournal
blocks, stored starting at sb.journal_bn
. Our handout file
system as 128 blocks in its journal area.
The journal comprises a series of records, each of which contains exactly
one metablock and zero or more associated datablocks. The records are
stored in circular fashion, so the i
th block written to the journal is
stored at block number sb.journal_bn + i % sb.njournal
. A journal
transaction consists of one or more records.
Metablocks follow this format:
static constexpr uint64_t journalmagic = 0xFBBFBB009EEBCEEDUL;
typedef uint16_t tid_t; ...
struct jblockref { // component of `jmetablock`
blocknum_t bn; // destination block number
uint32_t bchecksum; // CRC32C checksum of block data
uint16_t bflags; // see `jbf_` constants
};
struct jmetablock {
uint64_t magic; // must equal `chickadeefs::journalmagic`
uint32_t checksum; // CRC32C checksum of block starting at `seq`
uint32_t padding; // (not including magic, checksum, or padding)
tid_t seq; // sequence number
tid_t tid; // associated tid
tid_t commit_boundary; // first noncommitted tid
tid_t complete_boundary; // first noncompleted tid
uint16_t flags; // see `jf_` constants
uint16_t nref; // # valid entries in `ref`
jblockref ref[ref_size];
};
The jmb.nref
component says how many ref
entries are valid. Each ref
corresponds to a following datablock; the bn
component says where the
corresponding datablock contents belong on the disk, and the bchecksum
equals the crc32c
of the block data. A metablock can refer to at most
chickadeefs::ref_size
= 339 datablocks; a transaction containing more than
339 datablocks must use multiple records.
Metablock validity
The metablock design aims to survive arbitrary disk corruption of writes in flight, which requires a bunch of crosschecks. Here’s how it works.
Metablocks have a magic number and their checksum: the jmb.magic
field
(the first 8 bytes of the metablock) must equal journalmagic
, and the
jmb.checksum
field must equal crc32c(&jmb.seq, 4080)
. (But see journal
cheats below.)
Metablocks must be reliably distinguishable from datablocks for replay to
work. Unfortunately, a datablock can contain anything! What if it looks like a
metablock? Block escaping solves this problem. If a datablock starts with
journalmagic
, then the journal writer must modify the block before writing
it. The datablock’s first eight bytes are replaced with 0 in the journaled
version. The corresponding ref
record has the jbf_escaped
flag set; this
flag tells the replayer to replace those 0 bytes with journalmagic
before
writing the block to the main file system. (Note that the ref.bchecksum
field is on the journaled version, i.e., the escaped version. This is
annoying, so see journal cheats below.)
Metablocks are written with sequential seq
s. The first metablock written
should have jmb.seq == 0
, the second jmb.seq == 1
, and so forth. If more
than 65,536 metablocks are written, the numbers will wrap around.
Transactions
Journal records are grouped into transactions. These are the atomic units of file system change. The replay process figures out the set of committed, but incomplete, transactions in the journal, and replays only those.
The jmb.tid
component marks the transaction to which a record belongs.
tid
s are assigned sequentially (the first tid
is 0, the second 1, etc.,
with wraparound), but typically each transaction will contain several records,
so tid
will rarely equal seq
.
The first record in a transaction must be marked with jf_start
(that is, the
jmb.flags
component contains jf_start
). Transaction commit and completion
is calculated using the jmb.commit_boundary
and jmb.complete_boundary
fields, which are cumulative acknowledgments. All tids less than
jmb.commit_boundary
have committed, and all tids less than
jmb.complete_boundary
have completed. (Tid comparison is computed using
modular arithmetic.)
The process of writing a transaction works as follows.
First, the file system picks a new tid, greater than any previously-written tid. (The new tid will most likely equal the current
commit_boundary
.) It then writes a record for that tid withjf_start
set, optionally containing datablocks. Only the first record in the tid can havejf_start
.The file system then writes zero or more new records and datablocks for that tid.
To commit, the file system waits for all prior transactions to commit, then writes one record with
commit_boundary
set totid + 1
. (Thejf_commit
flag can also be set for clarity.) This is the commit record. The commit record may contain datablocks (but all future records for the tid must not contain datablocks). It is OK for the commit record to also start the transaction.The file system must then wait for all prior journal writes for that tid to complete (a write barrier).
When the transaction’s journal writes complete, the file system performs the actual writes to the main file system, in any order.
The file system must then wait for all associated main file system writes to complete (another write barrier).
When the transaction’s main file system writes complete, the file system writes one record for the tid with
complete_boundary
set totid + 1
. (Thejf_complete
flag can also be set for clarity.) This record marks the transaction as complete.
The journal blocks associated with a transaction must not be overwritten until the transaction successfully completes (i.e., all its writes are successfully written to the main file system). This might delay later file system changes.
Typical journal usage
A minimal journal transaction consists of two records. The first record
contains one metablock and n datablocks; the metablock both starts and
commits the transaction (so it has jf_start
set and its commit_boundary
equals tid+1
). The second metablock has no datablocks; it just marks the
transaction as complete (so its complete_boundary
equals tid+1
).
The simplest journal will contain journal records in strictly increasing tid
order, where tid i+1
is not started until tid i
is complete. You can,
however, change this if it’s convenient for you, or you want to experiment
with higher-performance journal writes. It is OK for transaction i+1
to
start before transaction i
commits, and it is OK for transaction i+1
to
commit before transaction i
completes. It is also OK for a single record to
commit or complete multiple transactions at once. However, transactions must
commit and complete in monotonically increasing order.
Journal modes
Several journaling modes exist, including full (also called data), ordered, and writeback journaling, that are distinguished by how much data they write to the journal and how much data is recoverable after a crash. You may implement any journal mode.
The most important purpose of journaling is to preserve the important file system invariants, namely:
- Every block is used for exactly one purpose.
- All referenced blocks are initialized.
- Every referenced block is marked allocated.
- Every unreferenced block is marked free.
Breaks in the first three invariants can cause total file system corruption. (Breaks in invariant 4 cause storage leaks, which are less serious.)
To preserve these invariants, every ChickadeeFS transaction must contain at minimum all metadata blocks touched by the transaction—that is, it must contain all free-block-bitmap, inode, directory-data, indirect, and indirect2 blocks modified by the transaction. The “writeback” journal mode writes only these blocks to the journal.
The second most important purpose of journaling is to make sure data writes survive system crashes—in other words, that after reboot, the user’s data is preserved. Writeback mode is insufficient for this purpose. The simplest way to recover user data is full journaling, where a journal transaction also contains all data blocks modified by the transaction. This can be very expensive, since data writes vastly outnumber metadata writes in most workloads. The ordered mode instead uses ordering to ensure data preservation. In an ordered-mode transaction, the file system writes data blocks directly to the main file system, but waits for those writes to complete before committing the associated transaction.
For more information on journaling modes, you might be interested in Section 3.1 of “Analysis and Evolution of Journaling File Systems”.
Block reuse
Block reuse represents a special problem for journaled file systems. Say a block is used for metadata (e.g., directory data or an indirect block), but the associated file is deleted. That block should be freed. However, the block must not be reused until the associated transaction commits! (If it were reused before commit, a badly-timed crash could violate the file system invariants.) Your file system should take care to not reuse metadata blocks until it is safe to do so. This will likely affect your block allocator.
Advanced: In some journaling modes, a transaction might need to mark a
block as irrelevant, preventing that block from being overwritten at replay
time. An example is if an ordered journal contains a block bn
written as
metadata by committed transaction i-2
, then reused for data in committed
transaction i
. The version of bn
in i-2
should not be written over the
version in i
(since both transactions have committed); but the version in
i
wasn’t written to the journal! To mark such blocks, add a ref
entry for
bn
to transaction i
with the jbf_nonjournaled
flag and no corresponding
datablock. This flag tells the replayer to ignore previously-committed
versions of the datablock.
Replay
The replay process examines all blocks in the journal and figures out what transactions are committed, but not complete. This process uses metablock sequence numbers, tids, and checksums.
The replayer first finds the valid metablock with the largest seq
. This
metablock’s commit_boundary
and complete_boundary
are used to distinguish
which transactions have committed and/or completed.
A valid journal contains the following transactions, in increasing tid order:
- Zero or more complete transactions. These transactions have
tid < complete_boundary
. Some of them may be invalid, because their contents have been partially overwritten by later records; some of them may be valid. - Zero or more committed transactions. These transactions have
complete_boundary <= tid < commit_boundary
. Furthermore, the journal replayer can find these transactions’jf_start
metablocks, and all succeeding records up to a commit record, with no gaps in sequence number, and with all checksums matching. These transactions must be replayed. - Zero or more pseudo-committed transactions. These transactions have
complete_boundary <= tid < commit_boundary
(so their tids appear to be committed), but some records are missing or have invalid checksums (so they are not actually committed). This can happen because we allow commit records to be written in parallel with other writes for the same transaction. Pseudo-committed transactions must not be replayed. - Zero or more uncommitted transactions with
commit_boundary <= tid
. These transactions must not be replayed.
It is a serious error, indicating a problem with your journaling logic, if transactions don’t appear in this order (e.g., a committed transaction has a higher tid than a pseudo-committed transaction).
We’ve given you the replay logic in journalreplayer.cc
. To implement journal
replay, you will write a derived class of chickadeefs::journalreplayer
,
providing your own implementations of the write_block
and
write_replay_complete
virtual functions.
Journal cheats
It’s a pain in the butt to get all aspects of journaling correct at once, so the journal format allows some cheating.
If a
checksum
orbchecksum
field equalschickadeefs::nochecksum
, then the corresponding checksum is not checked. So you can write that value for checksums at first.In practice, we don’t care if you implement block escaping—it’s really unlikely that any journal block will start with
journalmagic
and have a valid checksum, and the journal replayer interprets blocks with invalid checksums as datablocks, no matter how they start. Of course, if you implement escaping, that’s better!