Journal – CS 161 doc

The ChickadeeFS journal is used to make ChickadeeFS crash resistant. The journal consists of sb.njournal blocks, stored starting at sb.journal_bn. Our handout file system has 128 blocks in its journal area.

ChickadeeFS uses physical journaling: the journal contains full copies of the disk blocks that should be written. Physical journaling and checksumming can resist aggressive “Ragnarok” failures, where a disk block in the process of being written can be arbitrarily corrupted by a crash.

The journal consists of a series of records, each of which contains exactly one metablock followed by zero or more associated datablocks. The records are stored in circular fashion, so the ith block written to the journal is stored at block number sb.journal_bn + i % sb.njournal. A journal transaction consists of one or more records. The journal size limits the size of transactions: no transaction can contain more than sb.journal_bn blocks.

Metablocks

Metablocks follow this format:

chickadeefs.hh:103

static constexpr uint64_t journalmagic = 0xFBBFBB009EEBCEEDUL;
typedef uint16_t tid_t; ...

struct jmetablock {
    uint64_t magic;             // must equal `chkfs::journalmagic`
    uint32_t checksum;          // CRC32C checksum of block starting at `seq`
    uint32_t padding;
    // checksum starts here:
    tid_t seq;                  // sequence number
    tid_t tid;                  // associated tid
    tid_t commit_boundary;      // first noncommitted tid
    tid_t complete_boundary;    // first noncompleted tid
    uint16_t flags;             // see `jf_` constants
    uint16_t nref;              // # valid entries in `ref`
    jblockref ref[ref_size];
};

struct jblockref {
    blocknum_t bn;              // destination block number
    uint32_t bchecksum;         // CRC32C checksum of block data
    uint16_t bflags;            // see `jbf_` constants
};

The jmb.nref component says how many ref entries are valid. Each ref corresponds to a following datablock; the bn component says where the corresponding datablock contents belong on the disk, and the bchecksum equals the crc32c of the block data. A metablock can refer to at most chkfs::ref_size = 339 datablocks; a transaction containing more than 339 datablocks must use multiple records.

Metablock validity

ChickadeeFS journaling aims to survive arbitrary disk corruption of writes in flight, which requires a bunch of crosschecks. Here’s how it works.

Metablocks have a magic number and checksum: the jmb.magic field (the first 8 bytes of the metablock) must equal journalmagic, and the jmb.checksum field must equal crc32c(&jmb.seq, 4080). (But see journal cheats below.)

Metablocks must be reliably distinguishable from datablocks for replay to work. Unfortunately, a datablock can contain anything! What if it looks like a metablock? Block escaping solves this problem. If a datablock’s first eight bytes equal journalmagic, then the journal writer must replace the block’s first eight bytes with 0 before writing it to the journal. The corresponding ref record has the jbf_escaped flag set; this flag tells the replayer to replace those 0 bytes with journalmagic before writing the block to the main file system. (Note that the ref.bchecksum field is on the journaled version, i.e., the escaped version. Since block escaping is annoying, see journal cheats below.)

Metablocks are written with sequential seqs. The first metablock written should have jmb.seq == 0, the second jmb.seq == 1, and so forth. If more than 65,536 metablocks are written, the numbers will wrap around.

Transactions

Journal records are grouped into transactions. These are the atomic units of file system change. The replay process figures out the set of committed, but incomplete, transactions in the journal, and replays only those.

The jmb.tid component marks the transaction to which a record belongs. tids are assigned sequentially (the first tid is 0, the second 1, etc., with wraparound), but typically each transaction will contain several records, so tid will rarely equal seq.

The first record in a transaction must be marked with jf_start (that is, the jmb.flags component contains jf_start). Transaction commit and completion is calculated using the jmb.commit_boundary and jmb.complete_boundary fields, which are cumulative acknowledgments. All tids less than jmb.commit_boundary have committed, and all tids less than jmb.complete_boundary have completed. (Tid comparison is computed using modular arithmetic.)

The process of writing a transaction works as follows.

First, the file system picks a new tid, greater than any previously-written tid. (The new tid will most likely equal the current commit_boundary.) It then writes a record for that tid with jf_start set, optionally containing datablocks. Only the first record in the tid can have jf_start.
The file system then writes zero or more new records and datablocks for that tid.
To commit, the file system waits for all prior transactions to commit, then writes one record with commit_boundary set to tid + 1. (The jf_commit flag can also be set for clarity.) This is the commit record. The commit record may contain datablocks (but all future records for the tid must not contain datablocks). It is OK for the jf_start record to also commit the transaction.
The file system must then wait for all prior journal writes for that tid to complete (a write barrier).
When the transaction’s journal writes complete, the file system performs the actual writes to the main file system, in any order.
The file system must then wait for all associated main file system writes to complete (another write barrier).
When the transaction’s main file system writes complete, the file system writes one record for the tid with complete_boundary set to tid + 1. (The jf_complete flag can also be set for clarity.) This record marks the transaction as complete.

The journal blocks associated with a transaction must not be overwritten until the transaction successfully completes (i.e., all its writes are successfully written to the main file system). This might delay later file system changes.

Typical usage

A minimal journal transaction consists of two records. The first record contains one metablock and n datablocks; the metablock both starts and commits the transaction (so it has jf_start set and its commit_boundary equals tid+1). The second metablock has no datablocks; it just marks the transaction as complete (so its complete_boundary equals tid+1).

The simplest journal will contain journal records in strictly increasing tid order, where tid i+1 is not started until tid i is complete. You can, however, change this if it’s convenient for you, or you want to experiment with higher-performance journal writes. It is OK for transaction i+1 to start before transaction i commits, and it is OK for transaction i+1 to commit before transaction i completes. It is also OK for a single record to commit or complete multiple transactions at once. However, transactions must commit and complete in monotonically increasing order.

Journal modes

Several journaling modes exist, including full (also called data), ordered, and writeback journaling, that are distinguished by how much data they write to the journal and how much data is recoverable after a crash. You may implement any journal mode.

The most important purpose of journaling is to preserve the important file system invariants, namely:

Every block is used for exactly one purpose.
All referenced blocks are initialized.
Every referenced block is marked allocated.
Every unreferenced block is marked free.

Breaks in the first three invariants can cause total file system corruption. (Breaks in invariant 4 cause storage leaks, which are less serious.)

To preserve these invariants, every ChickadeeFS transaction must contain at minimum all metadata blocks touched by the transaction—that is, it must contain all free-block-bitmap, inode, directory-data, and indirect-extent blocks modified by the transaction. The “writeback” journal mode writes only these blocks to the journal.

The second most important purpose of journaling is to make sure data writes survive system crashes—in other words, that after reboot, the user’s data is preserved. Writeback mode is insufficient for this. The simplest way to recover user data is full journaling, where a journal transaction also contains all data blocks modified by the transaction. This can be very expensive, since data writes vastly outnumber metadata writes in most workloads. The ordered mode instead uses ordering to preserve user data: in an ordered-mode transaction, the file system writes data blocks directly to the main file system, but waits for those writes to complete before committing the associated transaction.

For more information on journaling modes, you might be interested in Section 3.1 of “Analysis and Evolution of Journaling File Systems”.

Block reuse

Block reuse represents a special problem for journaled file systems. Say a block is used for metadata (e.g., directory data or an indirect-extent block), but the associated file is deleted. That block should be freed. However, the block must not be reused until the associated transaction commits! (If it were reused before commit, a badly-timed crash could violate the file system invariants.) Your file system should take care to not reuse metadata blocks until it is safe to do so. This will likely affect your block allocator.

Advanced: In some journaling modes, a transaction might need to mark a block as irrelevant, preventing that block from being overwritten at replay time. An example is if an ordered journal contains a block bn written as metadata by committed transaction i-2, then reused for data in committed transaction i. The version of bn in i-2 should not be written over the version in i (since both transactions have committed); but the version in i wasn’t written to the journal! To mark such blocks, add a ref entry for bn to transaction i with the jbf_nonjournaled flag and no corresponding datablock. This flag tells the replayer to ignore previously-committed versions of the datablock.

Replay

The replay process examines all blocks in the journal and figures out what transactions are committed, but not complete. This process uses metablock sequence numbers, tids, and checksums.

A valid journal contains the following transactions, in increasing tid order:

Zero or more complete transactions. These transactions have tid < complete_boundary. Some of them may be invalid, because their contents have been partially overwritten by later records; some of them may be valid.
Zero or more committed transactions. These transactions have complete_boundary <= tid < commit_boundary. Furthermore, the journal replayer can find these transactions’ jf_start metablocks, and all succeeding records up to a commit record, with no gaps in sequence number, and with all checksums matching. These transactions must be replayed.
Zero or more pseudo-committed transactions. These transactions have complete_boundary <= tid < commit_boundary (so their tids appear to be committed), but some records are missing or have invalid checksums (so they are not actually committed). This can happen because we allow commit records to be written in parallel with other writes for the same transaction. Pseudo-committed transactions must not be replayed.
Zero or more uncommitted transactions with commit_boundary <= tid. These transactions must not be replayed.

It is a serious error, indicating a problem with your journaling logic, if transactions don’t appear in this order (e.g., a committed transaction has a higher tid than a pseudo-committed transaction).

But what are commit_boundary and complete_boundary? To find these, a journal replayer must first finds the valid metablock with the largest seq. This metablock’s cumulatively-acknowledged commit_boundary and complete_boundary are used to solve the replay problem.

The chkfs::journalreplayer class, defined in journalreplayer.cc, implements a lot of this journal logic. To implement journal replay, you will write a derived class of chkfs::journalreplayer, providing your own implementations of the write_block and write_replay_complete virtual functions.

Journal cheats

It’s a pain in the butt to get all aspects of journaling correct at once, so the journal format allows some cheating.

If a checksum or bchecksum field equals chkfs::nochecksum, then the corresponding checksum is not checked. So you can write that value for checksums at first.
In practice, we don’t care if you implement block escaping—it’s really unlikely that any journal block will start with journalmagic and have a valid checksum, and the journal replayer interprets blocks with invalid checksums as datablocks, no matter how they start. Of course, if you implement escaping, that’s better!