ChickadeeFS – CS 161 2018

ChickadeeFS, or chkfs for short, is the file system design you’ll implement in problem set 4. This page describes it.

Structure

A chkfs is comprised of 4096-byte blocks. (Thus, each chkfs block corresponds to a memory page.) A chkfs can contain up to nine kinds of block:

The superblock.
Swap blocks.
Free-block-bitmap blocks.
Inode blocks.
Indirect blocks.
Doubly-indirect (or indirect2) blocks.
Data blocks.
Directory blocks.
Journal blocks.

Of these, only data blocks contain file data (the other kinds contain file metadata, such as file size information and directory listings). User programs can directly read and write data blocks, and can read (but not directly write) directory blocks. All other kinds of block are read and modified only by special-purpose system calls.

The chkfs groups the block types into regions, which appear in a fixed order.

First comes exactly one superblock, in block 0.
Then, zero or more swap blocks (zero in the handout).
Then, one or more free-block-bitmap blocks (one in the handout).
Then, one or more inode blocks (14 in the handout).
Then, one or more blocks in the data region, which comprises data, indirect, indirect2, directory, and free blocks (~32000 in the handout).
Finally, zero or more blocks in the journal region (128 in the handout).

Chickadeefs structures are declared in chickadeefs.hh. The structures are part of a namespace chickadeefs, so, for example, you refer to an inode structure using the notation chickadeefs::inode. It may be useful to add using namespace chickadeefs; to a function definition or scope; this brings all the chickadeefs namespace definitions into scope so you can leave off the chickadeefs:: prefix.

Superblock

The superblock contains overall file system information, such as the total number of blocks and the location of the first inode block. It is stored in disk block 0, chickadeefs::superblock_offset bytes into the block (that is, at offset 512—but you should use the symbolic constant). This is because the first 512 bytes of the disk are reserved for the boot sector.

chickadeefs.hh:42

struct superblock {
    uint64_t magic;               // must equal `chickadeefs::magic`
    blocknum_t nblocks;           // # blocks in file system
    blocknum_t nswap;             // # blocks in swap space (unused)
    inum_t ninodes;               // # inodes in file system
    blocknum_t njournal;          // # journal blocks in file system
    blocknum_t swap_bn;           // first swap space block
    blocknum_t fbb_bn;            // first free block bitmap block
    blocknum_t inode_bn;          // first inode block
    blocknum_t data_bn;           // first data-area block
    blocknum_t journal_bn;        // first block in journal
};

Superblock components named *_bn are block numbers. For instance, in our handout file system, sb.fbb_bn == 1. This indicates that the first block in the free block bitmap is block number 1, which covers disk bytes 4096–8191 (and disk sectors 8–15).

The superblock need never be written.

Free block bitmap

The free block bitmap contains one bit per file system block, for a total of sb.nblocks bits. The bit for block bn is 1 iff that block is in the data area and is free for allocation. This function looks up the value of the bit for block bn:

bool fbb_lookup(blocknum_t bn) {
    unsigned char* fbb = /* read FBB into contiguous memory */;
    return (fbb[bn / 8] & (1 << (bn % 8))) != 0;
}

A single FBB block can represent up to 32768 blocks. Your code need not support larger chkfs file systems.

Inodes

The inode area is an array of inodes. An inode block contains space for 64 inodes; inode number inum is located in block number sb.inode_bn + inum/inodesperblock, at byte offset sizeof(inode) * (inum%inodesperblock).

Two inode numbers are special: inode number 0 is always free (it cannot be used), and inode number 1 is the root directory.

chickadeefs.hh:55

struct inode {
    uint32_t type;                // file type (regular, directory, or 0/none)
    uint32_t size;                // file size
    uint32_t nlink;               // # hard links to file
    std::atomic<uint32_t> mlock;  // used in memory
    std::atomic<uint32_t> mref;   // used in memory
    blocknum_t direct[ndirect];   // block pointers
    blocknum_t indirect;
    blocknum_t indirect2;
};

The inode::type member is either 0, meaning the inode is free, or type_directory or type_file, indicating the inode holds directory data or file data.

inode::size is the number of bytes in the file.

inode::nlink is the number of hard links to the file.

inode::direct contains 9 direct block pointers. Files of size 9 blocks or less (i.e., 36,864 bytes or less) will use only direct block pointers. Block number 0 means no pointer. inode::indirect and inode::indirect2 contain the indirect and indirect2 pointers, if any.

The inode::mlock and inode::mref are available for in-memory use; they are cleared to 0 whenever an inode is loaded into memory. The current plan is described in problem set 4.

Inode size

In the common case, inode sizes and block pointers match up. That is, if the file size is sz, then the file contains nb = ROUNDUP(sz, blocksize) / blocksize blocks, where every data-block pointer with index less than nb is nonzero (including both direct pointers and pointers stored in indirect and indirect2 blocks), and every data-block pointer with index greater than or equal to nb is zero.

But it is also OK for the inode’s size to be bigger or smaller than its block pointers. If some block pointers < nb are zero, then the file is a sparse file. When read, missing regions of the file should appear to contain zero bytes; when written, the file system should allocate blocks for file data on the fly. If, on the other hand, some block pointers >= nb are nonzero, then the file has some spare preallocated data in case the file is extended.

In the common case, the nlink field is zero iff the inode is free (i.e., type == 0). However, in some cases, an allocated inode will have zero link count. This happens if the last link to a file is removed while some program has a reference to the inode. The inode should be freed and the file data recycled only after the last reference to the inode is dropped.

Directories

A directory—that is, an inode with type == type_directory—comprises an array of direntry structures. Its size must be a multiple of sizeof(direntry) == 128.

chickadeefs.hh:73

struct dirent {
    inum_t inum;                  // inode # (0 = none)
    char name[maxnamelen + 1];    // file name (null terminated)
};

A directory entry with inum == 0 is ignored, and entries with negative inum are illegal. An entry with inum > 0 says that this directory contains a link to the given file under the specified name. A directory should not contain duplicate names.

Journal

The journal consists of sb.njournal blocks, stored starting at sb.journal_bn. Our handout file system as 128 blocks in its journal area.

The journal comprises a series of records, each of which contains exactly one metablock and zero or more associated datablocks. The records are stored in circular fashion, so the ith block written to the journal is stored at block number sb.journal_bn + i % sb.njournal. A journal transaction consists of one or more records.

Metablocks follow this format:

chickadeefs.hh:79

static constexpr uint64_t journalmagic = 0xFBBFBB009EEBCEEDUL;
typedef uint16_t tid_t; ...

struct jblockref {              // component of `jmetablock`
    blocknum_t bn;              // destination block number
    uint32_t bchecksum;         // CRC32C checksum of block data
    uint16_t bflags;            // see `jbf_` constants
};
struct jmetablock {
    uint64_t magic;             // must equal `chickadeefs::journalmagic`
    uint32_t checksum;          // CRC32C checksum of block starting at `seq`
    uint32_t padding;           // (not including magic, checksum, or padding)
    tid_t seq;                  // sequence number
    tid_t tid;                  // associated tid
    tid_t commit_boundary;      // first noncommitted tid
    tid_t complete_boundary;    // first noncompleted tid
    uint16_t flags;             // see `jf_` constants
    uint16_t nref;              // # valid entries in `ref`
    jblockref ref[ref_size];
};

The jmb.nref component says how many ref entries are valid. Each ref corresponds to a following datablock; the bn component says where the corresponding datablock contents belong on the disk, and the bchecksum equals the crc32c of the block data. A metablock can refer to at most chickadeefs::ref_size = 339 datablocks; a transaction containing more than 339 datablocks must use multiple records.

Metablock validity

The metablock design aims to survive arbitrary disk corruption of writes in flight, which requires a bunch of crosschecks. Here’s how it works.

Metablocks have a magic number and their checksum: the jmb.magic field (the first 8 bytes of the metablock) must equal journalmagic, and the jmb.checksum field must equal crc32c(&jmb.seq, 4080). (But see journal cheats below.)

Metablocks must be reliably distinguishable from datablocks for replay to work. Unfortunately, a datablock can contain anything! What if it looks like a metablock? Block escaping solves this problem. If a datablock starts with journalmagic, then the journal writer must modify the block before writing it. The datablock’s first eight bytes are replaced with 0 in the journaled version. The corresponding ref record has the jbf_escaped flag set; this flag tells the replayer to replace those 0 bytes with journalmagic before writing the block to the main file system. (Note that the ref.bchecksum field is on the journaled version, i.e., the escaped version. This is annoying, so see journal cheats below.)

Metablocks are written with sequential seqs. The first metablock written should have jmb.seq == 0, the second jmb.seq == 1, and so forth. If more than 65,536 metablocks are written, the numbers will wrap around.

Transactions

Journal records are grouped into transactions. These are the atomic units of file system change. The replay process figures out the set of committed, but incomplete, transactions in the journal, and replays only those.

The jmb.tid component marks the transaction to which a record belongs. tids are assigned sequentially (the first tid is 0, the second 1, etc., with wraparound), but typically each transaction will contain several records, so tid will rarely equal seq.

The first record in a transaction must be marked with jf_start (that is, the jmb.flags component contains jf_start). Transaction commit and completion is calculated using the jmb.commit_boundary and jmb.complete_boundary fields, which are cumulative acknowledgments. All tids less than jmb.commit_boundary have committed, and all tids less than jmb.complete_boundary have completed. (Tid comparison is computed using modular arithmetic.)

The process of writing a transaction works as follows.

First, the file system picks a new tid, greater than any previously-written tid. (The new tid will most likely equal the current commit_boundary.) It then writes a record for that tid with jf_start set, optionally containing datablocks. Only the first record in the tid can have jf_start.
The file system then writes zero or more new records and datablocks for that tid.
To commit, the file system waits for all prior transactions to commit, then writes one record with commit_boundary set to tid + 1. (The jf_commit flag can also be set for clarity.) This is the commit record. The commit record may contain datablocks (but all future records for the tid must not contain datablocks). It is OK for the commit record to also start the transaction.
The file system must then wait for all prior journal writes for that tid to complete (a write barrier).
When the transaction’s journal writes complete, the file system performs the actual writes to the main file system, in any order.
The file system must then wait for all associated main file system writes to complete (another write barrier).
When the transaction’s main file system writes complete, the file system writes one record for the tid with complete_boundary set to tid + 1. (The jf_complete flag can also be set for clarity.) This record marks the transaction as complete.

The journal blocks associated with a transaction must not be overwritten until the transaction successfully completes (i.e., all its writes are successfully written to the main file system). This might delay later file system changes.

Typical journal usage

A minimal journal transaction consists of two records. The first record contains one metablock and n datablocks; the metablock both starts and commits the transaction (so it has jf_start set and its commit_boundary equals tid+1). The second metablock has no datablocks; it just marks the transaction as complete (so its complete_boundary equals tid+1).

The simplest journal will contain journal records in strictly increasing tid order, where tid i+1 is not started until tid i is complete. You can, however, change this if it’s convenient for you, or you want to experiment with higher-performance journal writes. It is OK for transaction i+1 to start before transaction i commits, and it is OK for transaction i+1 to commit before transaction i completes. It is also OK for a single record to commit or complete multiple transactions at once. However, transactions must commit and complete in monotonically increasing order.

Journal modes

Several journaling modes exist, including full (also called data), ordered, and writeback journaling, that are distinguished by how much data they write to the journal and how much data is recoverable after a crash. You may implement any journal mode.

The most important purpose of journaling is to preserve the important file system invariants, namely:

Every block is used for exactly one purpose.
All referenced blocks are initialized.
Every referenced block is marked allocated.
Every unreferenced block is marked free.

Breaks in the first three invariants can cause total file system corruption. (Breaks in invariant 4 cause storage leaks, which are less serious.)

To preserve these invariants, every ChickadeeFS transaction must contain at minimum all metadata blocks touched by the transaction—that is, it must contain all free-block-bitmap, inode, directory-data, indirect, and indirect2 blocks modified by the transaction. The “writeback” journal mode writes only these blocks to the journal.

The second most important purpose of journaling is to make sure data writes survive system crashes—in other words, that after reboot, the user’s data is preserved. Writeback mode is insufficient for this purpose. The simplest way to recover user data is full journaling, where a journal transaction also contains all data blocks modified by the transaction. This can be very expensive, since data writes vastly outnumber metadata writes in most workloads. The ordered mode instead uses ordering to ensure data preservation. In an ordered-mode transaction, the file system writes data blocks directly to the main file system, but waits for those writes to complete before committing the associated transaction.

For more information on journaling modes, you might be interested in Section 3.1 of “Analysis and Evolution of Journaling File Systems”.

Block reuse

Block reuse represents a special problem for journaled file systems. Say a block is used for metadata (e.g., directory data or an indirect block), but the associated file is deleted. That block should be freed. However, the block must not be reused until the associated transaction commits! (If it were reused before commit, a badly-timed crash could violate the file system invariants.) Your file system should take care to not reuse metadata blocks until it is safe to do so. This will likely affect your block allocator.

Advanced: In some journaling modes, a transaction might need to mark a block as irrelevant, preventing that block from being overwritten at replay time. An example is if an ordered journal contains a block bn written as metadata by committed transaction i-2, then reused for data in committed transaction i. The version of bn in i-2 should not be written over the version in i (since both transactions have committed); but the version in i wasn’t written to the journal! To mark such blocks, add a ref entry for bn to transaction i with the jbf_nonjournaled flag and no corresponding datablock. This flag tells the replayer to ignore previously-committed versions of the datablock.

Replay

The replay process examines all blocks in the journal and figures out what transactions are committed, but not complete. This process uses metablock sequence numbers, tids, and checksums.

The replayer first finds the valid metablock with the largest seq. This metablock’s commit_boundary and complete_boundary are used to distinguish which transactions have committed and/or completed.

A valid journal contains the following transactions, in increasing tid order:

Zero or more complete transactions. These transactions have tid < complete_boundary. Some of them may be invalid, because their contents have been partially overwritten by later records; some of them may be valid.
Zero or more committed transactions. These transactions have complete_boundary <= tid < commit_boundary. Furthermore, the journal replayer can find these transactions’ jf_start metablocks, and all succeeding records up to a commit record, with no gaps in sequence number, and with all checksums matching. These transactions must be replayed.
Zero or more pseudo-committed transactions. These transactions have complete_boundary <= tid < commit_boundary (so their tids appear to be committed), but some records are missing or have invalid checksums (so they are not actually committed). This can happen because we allow commit records to be written in parallel with other writes for the same transaction. Pseudo-committed transactions must not be replayed.
Zero or more uncommitted transactions with commit_boundary <= tid. These transactions must not be replayed.

It is a serious error, indicating a problem with your journaling logic, if transactions don’t appear in this order (e.g., a committed transaction has a higher tid than a pseudo-committed transaction).

We’ve given you the replay logic in journalreplayer.cc. To implement journal replay, you will write a derived class of chickadeefs::journalreplayer, providing your own implementations of the write_block and write_replay_complete virtual functions.

Journal cheats

It’s a pain in the butt to get all aspects of journaling correct at once, so the journal format allows some cheating.

If a checksum or bchecksum field equals chickadeefs::nochecksum, then the corresponding checksum is not checked. So you can write that value for checksums at first.
In practice, we don’t care if you implement block escaping—it’s really unlikely that any journal block will start with journalmagic and have a valid checksum, and the journal replayer interprets blocks with invalid checksums as datablocks, no matter how they start. Of course, if you implement escaping, that’s better!