The ChickadeeFS journal
is used to make ChickadeeFS crash resistant. The journal consists
of sb.njournal
blocks, stored starting at sb.journal_bn
. Our handout file
system has 128 blocks in its journal area.
ChickadeeFS uses physical journaling: the journal contains full copies of the disk blocks that should be written. Physical journaling and checksumming can resist aggressive “Ragnarok” failures, where a disk block in the process of being written can be arbitrarily corrupted by a crash.
The journal consists of a series of records, each of which contains exactly
one metablock followed by zero or more associated datablocks. The records
are stored in circular fashion, so the i
th block written to the journal is
stored at block number sb.journal_bn + i % sb.njournal
. A journal
transaction consists of one or more records. The journal size limits the
size of transactions: no transaction can contain more than sb.journal_bn
blocks.
Metablocks
Metablocks follow this format:
static constexpr uint64_t journalmagic = 0xFBBFBB009EEBCEEDUL;
typedef uint16_t tid_t; ...
struct jmetablock {
uint64_t magic; // must equal `chkfs::journalmagic`
uint32_t checksum; // CRC32C checksum of block starting at `seq`
uint32_t padding;
// checksum starts here:
tid_t seq; // sequence number
tid_t tid; // associated tid
tid_t commit_boundary; // first noncommitted tid
tid_t complete_boundary; // first noncompleted tid
uint16_t flags; // see `jf_` constants
uint16_t nref; // # valid entries in `ref`
jblockref ref[ref_size];
};
struct jblockref {
blocknum_t bn; // destination block number
uint32_t bchecksum; // CRC32C checksum of block data
uint16_t bflags; // see `jbf_` constants
};
The jmb.nref
component says how many ref
entries are valid. Each ref
corresponds to a following datablock; the bn
component says where the
corresponding datablock contents belong on the disk, and the bchecksum
equals the crc32c
of the block data. A metablock can refer to at most
chkfs::ref_size
= 339 datablocks; a transaction containing more than
339 datablocks must use multiple records.
Metablock validity
ChickadeeFS journaling aims to survive arbitrary disk corruption of writes in flight, which requires a bunch of crosschecks. Here’s how it works.
Metablocks have a magic number and checksum: the jmb.magic
field
(the first 8 bytes of the metablock) must equal journalmagic
, and the
jmb.checksum
field must equal crc32c(&jmb.seq, 4080)
. (But see journal
cheats below.)
Metablocks must be reliably distinguishable from datablocks for replay to
work. Unfortunately, a datablock can contain anything! What if it looks like a
metablock? Block escaping solves this problem. If a datablock’s first eight bytes
equal journalmagic
, then the journal writer must replace the block’s first eight
bytes with 0 before writing it to the journal.
The corresponding ref
record has the jbf_escaped
flag set; this
flag tells the replayer to replace those 0 bytes with journalmagic
before
writing the block to the main file system. (Note that the ref.bchecksum
field is on the journaled version, i.e., the escaped version. Since block escaping
is annoying, see journal cheats below.)
Metablocks are written with sequential seq
s. The first metablock written
should have jmb.seq == 0
, the second jmb.seq == 1
, and so forth. If more
than 65,536 metablocks are written, the numbers will wrap around.
Transactions
Journal records are grouped into transactions. These are the atomic units of file system change. The replay process figures out the set of committed, but incomplete, transactions in the journal, and replays only those.
The jmb.tid
component marks the transaction to which a record belongs.
tid
s are assigned sequentially (the first tid
is 0, the second 1, etc.,
with wraparound), but typically each transaction will contain several records,
so tid
will rarely equal seq
.
The first record in a transaction must be marked with jf_start
(that is, the
jmb.flags
component contains jf_start
). Transaction commit and completion
is calculated using the jmb.commit_boundary
and jmb.complete_boundary
fields, which are cumulative acknowledgments. All tids less than
jmb.commit_boundary
have committed, and all tids less than
jmb.complete_boundary
have completed. (Tid comparison is computed using
modular arithmetic.)
The process of writing a transaction works as follows.
-
First, the file system picks a new tid, greater than any previously-written tid. (The new tid will most likely equal the current
commit_boundary
.) It then writes a record for that tid withjf_start
set, optionally containing datablocks. Only the first record in the tid can havejf_start
. -
The file system then writes zero or more new records and datablocks for that tid.
-
To commit, the file system waits for all prior transactions to commit, then writes one record with
commit_boundary
set totid + 1
. (Thejf_commit
flag can also be set for clarity.) This is the commit record. The commit record may contain datablocks (but all future records for the tid must not contain datablocks). It is OK for thejf_start
record to also commit the transaction. -
The file system must then wait for all prior journal writes for that tid to complete (a write barrier).
-
When the transaction’s journal writes complete, the file system performs the actual writes to the main file system, in any order.
-
The file system must then wait for all associated main file system writes to complete (another write barrier).
-
When the transaction’s main file system writes complete, the file system writes one record for the tid with
complete_boundary
set totid + 1
. (Thejf_complete
flag can also be set for clarity.) This record marks the transaction as complete.
The journal blocks associated with a transaction must not be overwritten until the transaction successfully completes (i.e., all its writes are successfully written to the main file system). This might delay later file system changes.
Typical usage
A minimal journal transaction consists of two records. The first record
contains one metablock and n datablocks; the metablock both starts and
commits the transaction (so it has jf_start
set and its commit_boundary
equals tid+1
). The second metablock has no datablocks; it just marks the
transaction as complete (so its complete_boundary
equals tid+1
).
The simplest journal will contain journal records in strictly increasing tid
order, where tid i+1
is not started until tid i
is complete. You can,
however, change this if it’s convenient for you, or you want to experiment
with higher-performance journal writes. It is OK for transaction i+1
to
start before transaction i
commits, and it is OK for transaction i+1
to
commit before transaction i
completes. It is also OK for a single record to
commit or complete multiple transactions at once. However, transactions must
commit and complete in monotonically increasing order.
Journal modes
Several journaling modes exist, including full (also called data), ordered, and writeback journaling, that are distinguished by how much data they write to the journal and how much data is recoverable after a crash. You may implement any journal mode.
The most important purpose of journaling is to preserve the important file system invariants, namely:
- Every block is used for exactly one purpose.
- All referenced blocks are initialized.
- Every referenced block is marked allocated.
- Every unreferenced block is marked free.
Breaks in the first three invariants can cause total file system corruption. (Breaks in invariant 4 cause storage leaks, which are less serious.)
To preserve these invariants, every ChickadeeFS transaction must contain at minimum all metadata blocks touched by the transaction—that is, it must contain all free-block-bitmap, inode, directory-data, and indirect-extent blocks modified by the transaction. The “writeback” journal mode writes only these blocks to the journal.
The second most important purpose of journaling is to make sure data writes survive system crashes—in other words, that after reboot, the user’s data is preserved. Writeback mode is insufficient for this. The simplest way to recover user data is full journaling, where a journal transaction also contains all data blocks modified by the transaction. This can be very expensive, since data writes vastly outnumber metadata writes in most workloads. The ordered mode instead uses ordering to preserve user data: in an ordered-mode transaction, the file system writes data blocks directly to the main file system, but waits for those writes to complete before committing the associated transaction.
For more information on journaling modes, you might be interested in Section 3.1 of “Analysis and Evolution of Journaling File Systems”.
Block reuse
Block reuse represents a special problem for journaled file systems. Say a block is used for metadata (e.g., directory data or an indirect-extent block), but the associated file is deleted. That block should be freed. However, the block must not be reused until the associated transaction commits! (If it were reused before commit, a badly-timed crash could violate the file system invariants.) Your file system should take care to not reuse metadata blocks until it is safe to do so. This will likely affect your block allocator.
Advanced: In some journaling modes, a transaction might need to mark a
block as irrelevant, preventing that block from being overwritten at replay
time. An example is if an ordered journal contains a block bn
written as
metadata by committed transaction i-2
, then reused for data in committed
transaction i
. The version of bn
in i-2
should not be written over the
version in i
(since both transactions have committed); but the version in
i
wasn’t written to the journal! To mark such blocks, add a ref
entry for
bn
to transaction i
with the jbf_nonjournaled
flag and no corresponding
datablock. This flag tells the replayer to ignore previously-committed
versions of the datablock.
Replay
The replay process examines all blocks in the journal and figures out what transactions are committed, but not complete. This process uses metablock sequence numbers, tids, and checksums.
A valid journal contains the following transactions, in increasing tid order:
- Zero or more complete transactions. These transactions have
tid < complete_boundary
. Some of them may be invalid, because their contents have been partially overwritten by later records; some of them may be valid. - Zero or more committed transactions. These transactions have
complete_boundary <= tid < commit_boundary
. Furthermore, the journal replayer can find these transactions’jf_start
metablocks, and all succeeding records up to a commit record, with no gaps in sequence number, and with all checksums matching. These transactions must be replayed. - Zero or more pseudo-committed transactions. These transactions have
complete_boundary <= tid < commit_boundary
(so their tids appear to be committed), but some records are missing or have invalid checksums (so they are not actually committed). This can happen because we allow commit records to be written in parallel with other writes for the same transaction. Pseudo-committed transactions must not be replayed. - Zero or more uncommitted transactions with
commit_boundary <= tid
. These transactions must not be replayed.
It is a serious error, indicating a problem with your journaling logic, if transactions don’t appear in this order (e.g., a committed transaction has a higher tid than a pseudo-committed transaction).
But what are commit_boundary
and complete_boundary
? To find these, a
journal replayer must first finds the valid metablock with the largest seq
.
This metablock’s cumulatively-acknowledged commit_boundary
and
complete_boundary
are used to solve the replay problem.
The chkfs::journalreplayer
class, defined in journalreplayer.cc
,
implements a lot of this journal logic. To implement journal replay, you will
write a derived class of chkfs::journalreplayer
, providing your own
implementations of the write_block
and write_replay_complete
virtual
functions.
Journal cheats
It’s a pain in the butt to get all aspects of journaling correct at once, so the journal format allows some cheating.
-
If a
checksum
orbchecksum
field equalschkfs::nochecksum
, then the corresponding checksum is not checked. So you can write that value for checksums at first. -
In practice, we don’t care if you implement block escaping—it’s really unlikely that any journal block will start with
journalmagic
and have a valid checksum, and the journal replayer interprets blocks with invalid checksums as datablocks, no matter how they start. Of course, if you implement escaping, that’s better!