Devices and prefetching
- The point is to talk about how you can do prefetching
- Getting there will take a while
Chickadee devices
- x86-64 CPU
- Interrupt controllers
- Primary memory (vmiter)
- AT keyboard controller (keyboardstate)
- Console (console)
- Power management controllers (PIIX4/ICH9) (poweroff)
- Parallel port (log_printf)
- SATA (Serial ATA) AHCI (Advanced Host Controller Interface) (ahcistate)
Device interaction
- How can the CPU send commands to the device and read results?
- How can the device send commands to the CPU?
Port-mapped I/O
- 
Special instructions for communicating with devices 
- 
Involves CPU in every device transaction 
- 
Example: Reading a character from the keyboard if ((inb(KEYBOARD_STATUSREG) & KEYBOARD_STATUS_READY) == 0) { return -1; } uint8_t data = inb(KEYBOARD_DATAREG);- This interface is very, very old
- Present in IBM’s 1984 PC/AT
- inbis an instruction
- KEYBOARD_STATUSREG == 0x64is a fixed address
 
Memory-mapped I/O
- 
Device presents CPU with an interface that looks like primary memory 
- 
Writing/reading memory sends commands/reads results 
- 
Device defines the memory layout 
- 
Still involves CPU in every device transaction 
- 
Example: Color Graphics Adapter (CGA) console console = (uint16_t*) 0xB8000; // fixed physical address console[0] = 'H' | 0xC000; console[1] = 'i' | 0xC000; console[2] = '!' | 0xC000;
- 
Devices can combine modes; here, PMIO moves the console cursor void console_show_cursor(int cpos) { … outb(0x3D4, 14); outb(0x3D5, cpos / 256); outb(0x3D4, 15); outb(0x3D5, cpos % 256); }
Discoverable I/O
- Device uses MMIO and/or PMIO, but different layout for different PCs
- CPU must first discover this machine’s layout by searching configuration state
- Example: Turning off the computer by finding a power controller using the PCI bus
    int addr = pci.find([&] (int a) {
            uint32_t vd = pci.readl(a + pci.config_vendor);
            return vd == 0x71138086U /* PIIX4 Power Management Controller */
                || vd == 0x29188086U /* ICH9 LPC Interface Controller */;
        });
    assert(addr >= 0);
    // Read I/O base register from controller's PCI configuration space.
    int pm_io_base = pci.readl(addr + 0x40) & 0xFFC0;
    // Write `suspend enable` to the power management control register.
    outw(pm_io_base + 4, 0x2000);
Interrupts
- Device sends to indicate a change of state
- proc::exception
Direct memory access
- Device can access the actual primary memory independent of the CPU
- Software sets up structures for controlling the device, based on device specifications
- Software configures the device with the addresses of these structures, using MMIO and/or PMIO
- Often using physical addresses
- Allowed addresses may be constrained
 
- Software writes commands to structures, may alert device
- Device reads commands from structures and writes results back, may alert CPU
- Example: AHCI for block devices
AHCI overall layout

AHCI initialization
- Search configuration state for an ATA block device
- Read address of “memory registers” (MMIO) from configuration state (PMIO)
- Memory registers use reserved physical addresses, live on device
 
- Configure memory registers with DMA structure addresses
- DMA structures are kernel memory
- Part of struct ahcistate
 
ahcistate* ahcistate::find(int addr, int port) {
    auto& pci = pcistate::get();
    for (; addr >= 0; addr = pci.next(addr), port = 0) {
        if (pci.readw(addr + pci.config_subclass) != 0x0106) {
            continue;
        }
        uint32_t pa = pci.readl(addr + pci.config_bar5);
        if (pa == 0) {
            continue;
        }
        auto dr = pa2ka<volatile regs*>(pa); // memory-mapped I/O
        dr->ghc = ghc_ahci_enable; // indicate host is AHCI aware
        // assume port 0 is available
        assert((dr->port_mask & 1U) && dr->p[0].sstatus);
        return knew<ahcistate>(addr, 0, dr);
    }
}
struct ahcistate { …
    // DMA and memory-mapped I/O state
    dmastate dma_;
    int pci_addr_;
    int sata_port_;
    volatile regs* dr_;
    volatile portregs* pr_;
    …
};
ahcistate::ahcistate(...) { …
    // place port in idle state
    pr_->command &= ~uint32_t(pcmd_rfis_enable | pcmd_start);
    while (pr_->command & (pcmd_command_running | pcmd_rfis_running)) {
        pause();
    }
    // set up DMA area
    memset(&dma_, 0, sizeof(dma_));
    for (int i = 0; i != 32; ++i) {
        dma_.ch[i].cmdtable_pa = ka2pa(&dma_.ct[i]);
    }
    pr_->cmdlist_pa = ka2pa(&dma_.ch[0]);
    pr_->rfis_pa = ka2pa(&dma_.rfis);
    …
}
AHCI port structure (one per drive)
- Each drive can process up to 32 commands in parallel
- Each command has an entry in the command list and an entry in the command table

Why 32 commands in parallel?
- Allow block device to schedule and reorder requests using information only it knows
- Serious performance benefits


AHCI command list structure

AHCI command table structure

Reading or writing a block, abstractly
- Read: “Dear device, please read K bytes from the device starting at sector N and write them to memory at address A, and let me know when you have done so”
- Write: “Dear disk, please read K bytes from memory starting at address A and write them to the device starting at sector N, and let me know when you have done so”
Reading or writing a block, concretely
- Read and write are commands
- CPU must:
- Find a slot for the command
- Prepare the slot
- Issue the command
- Wait for response
 
Reading or writing a block, in code
    clear(0);
    push_buffer(0, buf, sz);
    issue_ncq(0, command, off / sectorsize);
- Requires ahcistate::lock_
- Expanded (for command cmd_read_fpdma_queued, i.e., read):
    int slot = 0;
    assert(/* `slot` is not currently an active command */);
    // clear
    dma_.ch[slot].nbuf = 0;
    dma_.ch[slot].buf_byte_pos = 0;
    // push_buffer
    dma_.ct[slot].buf[0].pa = kptr2pa(buf);
    dma_.ct[slot].buf[0].maxbyte = sz - 1;
    dma_.ch[slot].nbuf = 1;
    dma_.ch[slot].buf_byte_pos = sz;
    // issue_ncq
    size_t first_sector = off / sectorsize;
    size_t nsectors = sz / sectorsize;
    dma_.ct[slot].cfis[0] = cfis_command
        | (unsigned(cmd_read_fpdma_queued) << 16)
        | ((nsectors & 0xFF) << 24);
    dma_.ct[slot].cfis[1] = (first_sector & 0xFFFFFF)
        | (uint32_t(fua) << 31) | 0x40000000U;
    dma_.ct[slot].cfis[2] = (first_sector >> 24)
        | ((nsectors & 0xFF00) << 16);
    dma_.ct[slot].cfis[3] = (slot << 3);
    dma_.ch[slot].flags = 4 /* # words in `cfis` */
        | ch_clear_flag; // would also contain `ch_write_flag` for writes
    dma_.ch[slot].buf_byte_pos = 0;
    // ensure all previous writes have made it out to memory
    std::atomic_thread_fence(std::memory_order_release);
    // tell interface NCQ slot used
    pr_->ncq_active_mask = 1U << slot; // NB!
    // tell interface command available
    pr_->command_mask = 1U << slot;
    // The write to `command_mask` wakes up the device.
I/O parallelism and synchronization
- So the command has been issued to the disk. What happens next?
- At some point in the future, the disk will perform the command.
- When it is done, it signals completion by modifying the MMIO area and alerting the CPU with an interrupt.
- While the command is pending, the operating system had better not mess up
the command memory!
- Don’t reuse the slot
- Don’t modify the command structures in DMA memory
- Don’t modify the buf
 
Interrupt processing
- Acknowledge to the disk that the interrupt has been received
- Mark commands as completed, and therefore available for reuse
- Inform kernel thread waiting for result of command
    // obtain lock, read data
    auto irqs = lock_.lock();
    // check interrupt reason, clear interrupt
    assert(/* not error */);
    pr_->interrupt_status = ~0U;
    dr_->interrupt_status = ~0U;
    // acknowledge completed commands
    uint32_t still_active = pr_->ncq_active_mask;
    uint32_t acks = slots_outstanding_mask_ & ~still_active;
    for (int slot = 0; acks != 0; ++slot, acks >>= 1) {
        if (acks & 1) {
            // mark `slot` as successfully completed and available for reuse
            assert(slots_outstanding_mask_ & (1U << slot));
            slots_outstanding_mask_ &= ~(1U << slot);
            ++nslots_available_;
            inform_thread_waiting_for_slot();
        }
    }
    // acknowledge errored commands
    if (is_error) {
        handle_error_interrupt();
    }
    lock_.unlock(irqs); …
Informing the kernel thread: handout code
- 
A std::atomic<int>per command communicates command completion
- 
ahcistate::wq_blocks until some command completes
- 
Enqueuing command: // send command, record buffer and status storage std::atomic<int> r = E_AGAIN; ...clear, push_buffer, issue_ncq... slot_status_[0] = &r; // NB kernel-only, not DMA or MMIO … waiter(p).block_until(wq_, [&] () { return r != E_AGAIN; });
- 
Handling interrupt: … slots_outstanding_mask_ &= ~(1U << slot); ++nslots_available_; if (slot_status_[slot]) { *slot_status_[slot] = 0; // means success slot_status_[slot] = nullptr; } … wq_.wake_all();
What does prefetching mean?
- Blocking read: A kernel thread needs data and cannot continue until it is available
- Prefetching read: Read in advance of the data being required
- No kernel thread is blocking on the data!
 
Separating concerns in prefetching
- Prefetching policy
- What to prefetch and when?
- Often file system- and workload-specific
- See posix_fadvise,madvise
 
- Prefetching implementation
- How to prefetch?
 
Prefetching thread
- Add a new prefetching kernel thread to the system
- It can read blocks using blocking calls
- Prefetching policy
- Other threads pass it block numbers to prefetch
 
- Prefetching implementation
- Blocks waiting for block numbers
- Prefetches blocks as they arrive
 
- Advantages?
- Disadvantages?
Multiple prefetching threads
- Modify ahcistate::read_or_writeso it can use any NCQ slot
- Add multiple prefetching kernel threads to the system
- Advantages?
- Disadvantages?
Nonblocking prefetching
- Add a new function ahcistate::read_or_write_nonblockingthat can use any NCQ slot
- Prefetching policy
- Fires off new nonblocking prefetching requests
 
- Must ensure a mechanism to determine prefetching completion
Example: Extend bcslot
struct bcentry {
    spinlock lock_;
    std::atomic<int> state_ = state_empty;
    std::atomic<unsigned> ref_ = 0;
    blocknum_t bn_;
    std::atomic<int> ready_;  // new
- Why atomic?
- What code is writing this value?
- What code is reading this value?
Handout code: Waiting for a bcslot to read
- Problem: Multiple kernel threads might want to read the same block
- Solution: Synchronization within bcslot::load
    // load block, or wait for concurrent reader to load it
    while (true) {
        assert(state_ != state_empty);
        if (state_ == state_allocated) { …
            state_ = state_loading;
            lock_.unlock(irqs);
            sata_disk->read(buf_, chkfs::blocksize,
                            bn_ * chkfs::blocksize);
            // which means:
            sata_disk->read_or_write
                (ahcistate::cmd_read_fpdma_queued,
                 buf_, chkfs::blocksize, bn_ * chkfs::blocksize);
            irqs = lock_.lock();
            …
        } else if (state_ == state_loading) {
            waiter(current()).block_until(bc.read_wq_, [&] () {
                    return state_ != state_loading;
                }, lock_, irqs);
        } else …
    }
Nonblocking bcslot::load?
    while (true) {
        if (state_ == state_allocated) { …
            state_ = state_loading;
            ready_ = E_AGAIN;
            sata_disk->read_or_write_nonblocking
                (ahcistate::cmd_read_fpdma_queued,
                 buf_, chkfs::blocksize, bn_ * chkfs::blocksize,
                 &ready_);
            // ????????
            …