- The point is to talk about how you can do prefetching
- Getting there will take a while
Chickadee devices
- x86-64 CPU
- Interrupt controllers
- Primary memory (
vmiter
) - AT keyboard controller (
keyboardstate
) - Console (
console
) - Power management controllers (PIIX4/ICH9) (
poweroff
) - Parallel port (
log_printf
) - SATA (Serial ATA) AHCI (Advanced Host Controller Interface) (
ahcistate
)
Device interaction
- How can the CPU send commands to the device and read results?
- How can the device send commands to the CPU?
Port-mapped I/O
- Special instructions for communicating with devices
- Involves CPU in every device transaction
Example: Reading a character from the keyboard
if ((inb(KEYBOARD_STATUSREG) & KEYBOARD_STATUS_READY) == 0) { return -1; } uint8_t data = inb(KEYBOARD_DATAREG);
- This interface is very, very old
- Present in IBM’s 1984 PC/AT
inb
is an instructionKEYBOARD_STATUSREG == 0x64
is a fixed address
Memory-mapped I/O
- Device presents CPU with an interface that looks like primary memory
- Writing/reading memory sends commands/reads results
- Device defines the memory layout
- Still involves CPU in every device transaction
- Example: Color Graphics Adapter (CGA) console
console = (uint16_t*) 0xB8000; // fixed physical address
console[0] = 'H' | 0xC000;
console[1] = 'i' | 0xC000;
console[2] = '!' | 0xC000;
- Devices can combine modes; here, PMIO moves the console cursor
void console_show_cursor(int cpos) { …
outb(0x3D4, 14);
outb(0x3D5, cpos / 256);
outb(0x3D4, 15);
outb(0x3D5, cpos % 256);
}
Discoverable I/O
- Device uses MMIO and/or PMIO, but different layout for different PCs
- CPU must first discover this machine’s layout by searching configuration state
- Example: Turning off the computer by finding a power controller using the PCI bus
int addr = pci.find([&] (int a) {
uint32_t vd = pci.readl(a + pci.config_vendor);
return vd == 0x71138086U /* PIIX4 Power Management Controller */
|| vd == 0x29188086U /* ICH9 LPC Interface Controller */;
});
assert(addr >= 0);
// Read I/O base register from controller's PCI configuration space.
int pm_io_base = pci.readl(addr + 0x40) & 0xFFC0;
// Write `suspend enable` to the power management control register.
outw(pm_io_base + 4, 0x2000);
Interrupts
- Device sends to indicate a change of state
proc::exception
Direct memory access
- Device can access the actual primary memory independent of the CPU
- Software sets up structures for controlling the device, based on device specifications
- Software configures the device with the addresses of these structures, using MMIO and/or PMIO
- Often using physical addresses
- Allowed addresses may be constrained
- Software writes commands to structures, may alert device
- Device reads commands from structures and writes results back, may alert CPU
- Example: AHCI for disks
AHCI overall layout
AHCI initialization
- Search configuration state for an ATA disk
- Read address of “memory registers” (MMIO) from configuration state (PMIO)
- Memory registers use reserved physical addresses, live on device
- Configure memory registers with DMA structure addresses
- DMA structures are kernel memory
- Part of
struct ahcistate
ahcistate* ahcistate::find(int addr, int port) {
auto& pci = pcistate::get();
for (; addr >= 0; addr = pci.next(addr), port = 0) {
if (pci.readw(addr + pci.config_subclass) != 0x0106) {
continue;
}
uint32_t pa = pci.readl(addr + pci.config_bar5);
if (pa == 0) {
continue;
}
auto dr = pa2ka<volatile regs*>(pa); // memory-mapped I/O
dr->ghc = ghc_ahci_enable; // indicate host is AHCI aware
// assume port 0 is available
assert((dr->port_mask & 1U) && dr->p[0].sstatus);
return knew<ahcistate>(addr, 0, dr);
}
}
struct ahcistate { …
// DMA and memory-mapped I/O state
dmastate dma_;
int pci_addr_;
int sata_port_;
volatile regs* dr_;
volatile portregs* pr_;
…
};
ahcistate::ahcistate(...) { …
// place port in idle state
pr_->command &= ~uint32_t(pcmd_rfis_enable | pcmd_start);
while (pr_->command & (pcmd_command_running | pcmd_rfis_running)) {
pause();
}
// set up DMA area
memset(&dma_, 0, sizeof(dma_));
for (int i = 0; i != 32; ++i) {
dma_.ch[i].cmdtable_pa = ka2pa(&dma_.ct[i]);
}
pr_->cmdlist_pa = ka2pa(&dma_.ch[0]);
pr_->rfis_pa = ka2pa(&dma_.rfis);
…
}
AHCI port structure (one per drive)
- Each drive can process up to 32 commands in parallel
- Each command has an entry in the command list and an entry in the command table
Why 32 commands in parallel?
- Allow disk to schedule and reorder requests using information only it knows
- Serious performance benefits
AHCI command list structure
AHCI command table structure
Reading or writing a block, abstractly
- Read: “Dear disk, please read K bytes from disk starting at sector N and write them to memory at address A, and let me know when you have done so”
- Write: “Dear disk, please read K bytes from memory starting at address A and write them to disk starting at sector N, and let me know when you have done so”
Reading or writing a block, concretely
- Read and write are commands
- CPU must:
- Find a slot for the command
- Prepare the slot
- Issue the command
- Wait for response
Reading or writing a block, in code
clear(0);
push_buffer(0, buf, sz);
issue_ncq(0, command, off / sectorsize);
- Requires
ahcistate::lock_
- Expanded (for command
cmd_read_fpdma_queued
, i.e., read):
int slot = 0;
assert(/* `slot` is not currently an active command */);
// clear
dma_.ch[slot].nbuf = 0;
dma_.ch[slot].buf_byte_pos = 0;
// push_buffer
dma_.ct[slot].buf[0].pa = ka2pa(buf);
dma_.ct[slot].buf[0].maxbyte = sz - 1;
dma_.ch[slot].nbuf = 1;
dma_.ch[slot].buf_byte_pos = sz;
// issue_ncq
size_t first_sector = off / sectorsize;
size_t nsectors = sz / sectorsize;
dma_.ct[slot].cfis[0] = cfis_command
| (unsigned(cmd_read_fpdma_queued) << 16)
| ((nsectors & 0xFF) << 24);
dma_.ct[slot].cfis[1] = (first_sector & 0xFFFFFF)
| (uint32_t(fua) << 31) | 0x40000000U;
dma_.ct[slot].cfis[2] = (first_sector >> 24)
| ((nsectors & 0xFF00) << 16);
dma_.ct[slot].cfis[3] = (slot << 3);
dma_.ch[slot].flags = 4 /* # words in `cfis` */
| ch_clear_flag; // would also contain `ch_write_flag` for writes
dma_.ch[slot].buf_byte_pos = 0;
// ensure all previous writes have made it out to memory
std::atomic_thread_fence(std::memory_order_release);
// tell interface NCQ slot used
pr_->ncq_active_mask = 1U << slot; // NB!
// tell interface command available
pr_->command_mask = 1U << slot;
// The write to `command_mask` wakes up the device.
I/O parallelism and synchronization
- So the command has been issued to the disk. What happens next?
- At some point in the future, the disk will perform the command.
- When it is done, it signals completion by modifying the MMIO area and alerting the CPU with an interrupt.
- While the command is pending, the operating system had better not mess up
the command memory!
- Don’t reuse the slot
- Don’t modify the command structures in DMA memory
- Don’t modify the
buf
Interrupt processing
- Acknowledge to the disk that the interrupt has been received
- Mark commands as completed, and therefore available for reuse
- Inform kernel thread waiting for result of command
// obtain lock, read data
auto irqs = lock_.lock();
// check interrupt reason, clear interrupt
assert(/* not error */);
pr_->interrupt_status = ~0U;
dr_->interrupt_status = ~0U;
// acknowledge completed commands
uint32_t still_active = pr_->ncq_active_mask;
uint32_t acks = slots_outstanding_mask_ & ~still_active;
for (int slot = 0; acks != 0; ++slot, acks >>= 1) {
if (acks & 1) {
// mark `slot` as successfully completed and available for reuse
assert(slots_outstanding_mask_ & (1U << slot));
slots_outstanding_mask_ &= ~(1U << slot);
++nslots_available_;
inform_thread_waiting_for_slot();
}
}
// acknowledge errored commands
if (is_error) {
handle_error_interrupt();
}
lock_.unlock(irqs); …
Informing the kernel thread: handout code
- A
volatile int
per command communicates command completionvolatile
because an interrupt can happen whenever
ahcistate::wq_
blocks until some command completesEnqueuing command:
// send command, record buffer and status storage int slot = 0; ...clear, push_buffer, issue_ncq... volatile int r = E_AGAIN; slot_status_[0] = &r; // NB kernel-only, not DMA or MMIO … waiter(p).block_until(wq_, [&] () { return r != E_AGAIN; });
Processing interrupt:
… slots_outstanding_mask_ &= ~(1U << slot); ++nslots_available_; if (slot_status_[slot]) { *slot_status_[slot] = 0; // means success slot_status_[slot] = nullptr; }
What does prefetching mean?
- Blocking read: A kernel thread needs data and cannot continue until it is available
- Prefetching read: Read in advance of the data being required
- No kernel thread is blocking on the data!
Separating concerns in prefetching
- Prefetching policy
- What to prefetch and when?
- Often file system- and workload-specific
- See
posix_fadvise
,madvise
- Prefetching implementation
- How to prefetch?
Prefetching thread
- Add a new prefetching kernel thread to the system
- It prefetches blocks using blocking calls
- Prefetching policy
- Stores block numbers to prefetch somewhere (lock-protected)
- Wakes up the prefetching thread
- Prefetching thread
- Blocks waiting for block numbers
- Prefetches blocks as they arrive
- Advantages?
- Disadvantages?
Multiple prefetching threads
- Modify
ahcistate::read_or_write
so it can use any NCQ slot - Add multiple prefetching kernel threads to the system
- Advantages?
- Disadvantages?
Nonblocking prefetching
- Add a new function
ahcistate::read_or_write_nonblocking
that can use any NCQ slot - Prefetching policy
- Fires off new nonblocking prefetching requests
- Must ensure a mechanism to determine prefetching completion
Example: Extend bcentry
struct bcentry {
std::atomic<int> state_ = state_empty;
spinlock lock_;
blocknum_t bn_;
std::atomic<unsigned> ref_ = 0;
volatile int ready_; // new
- Why
volatile int
? - What code is writing this value?
- What code is reading this value?
Handout code: Waiting for a bcentry
to read
- Problem: Multiple kernel threads might want to read the same block
- Solution: Synchronization within
bcentry::load
// load block, or wait for concurrent reader to load it
while (true) {
assert(state_ != state_empty);
if (state_ == state_allocated) { …
state_ = state_loading;
lock_.unlock(irqs);
sata_disk->read(buf_, chkfs::blocksize,
bn_ * chkfs::blocksize);
// which means:
sata_disk->read_or_write
(ahcistate::cmd_read_fpdma_queued,
buf_, chkfs::blocksize, bn_ * chkfs::blocksize);
irqs = lock_.lock();
…
} else if (state_ == state_loading) {
waiter(current()).block_until(bc.read_wq_, [&] () {
return state_ != state_loading;
}, lock_, irqs);
} else …
}
Nonblocking bcentry::load
?
while (true) {
if (state_ == state_allocated) { …
state_ = state_loading;
ready_ = E_AGAIN;
sata_disk->read_or_write_nonblocking
(ahcistate::cmd_read_fpdma_queued,
buf_, chkfs::blocksize, bn_ * chkfs::blocksize,
&ready_);
// ????????
…