- SSD performance review: This recent SSD has 20x better 4KB random read throughput when using queue depth 32 (i.e., NCQ issues up to 32 parallel commands) compared to queue depth 1
- Theodore Ts'o on disk corruption: why block writes are not atomic, and why physical journaling is better than logical
(Notes by Abby Lyons)
(these notes start 40 minutes in, sry)
Flash memory is weird. You can't just overwrite one block; you must overwrite 16 blocks (64 kb) at once. Blocks also degrade over time.
Modern flash drives have a remapping layer. It marks sectors as bad when they arise; so only the drive knows where blocks are located, and the OS doesn't.
The goal is to have a journal format that tolerates disk failure due to power failure.
Disk failure model:
w w w w w w
The last two writes are in-flight. what happened to them?
- Academic failure model: either the write 100% happened or it 100% didn't. Block writes are atomic. In practice, crazier things happen. We can use expensive hardware, or we can find a workaround using software.
- Aggressive failure model: in-flight writes can be arbitrarily corrupt. This is more reasonable to assume.
Our solution to the aggressive failure model is physical redo journaling.
- physical: This means that entire physical blocks are written into the journal.
- redo: the journal is designed in a way that disk recovery redoes the operations that made it into the journal but not the disk.
One common alternative to physical journaling is logical journaling, which doesn't store entire blocks; rather, it has records that track changes like "change byte 0xabc from 3 to 5". This means the journal is smaller, but can't protect against the most aggressive failures which turn entire blocks to garbage.
To use the journal, one must:
- write blocks to disk in a special journal location
- write commit record
- when journal writes complete, write blocks to their final locations
- when those writes complete, write a completion record to the journal. Journal can now be overwritten.
What about parallelism? As it turns out, steps 1 and 2 can be done in parallel. Then every part of step 3 can be done in parallel. Only after everything else is complete can step 4 take place.
But we must wait after step 2 to start step 3, and wait after step 3 to start step 4. These mandatory waits are called barriers. A barrier is a kind of synchronization primitive.
The journal in memory
- The entire journal consists of a circular buffer with a series of records.
- A record consists of:
- One metablock, which stores the block numbers and checksums of data blocks
- 0 or more data blocks
Checksums help us figure out whether writes happened successfully inside the journal; if the checksum in the metablock matches the checksum of the corresponding data block, the write happened without any corruption. The metadata block stores a checksum of itself, in case it gets corrupted.
Each journal block has a sequence number, which goes up by one every time. This is useful because:
- If there's a missing sequence number, we know a corruption happened.
- It helps us figure out where the beginning of the circular journal buffer is.
Each metablock has a commit boundary and a complete boundary. This may seem redundant, but it's used for cumulative acknowledgement, which means that "I have heard everything up to point X." We use this to figure out which transactions were actually completed.