Lecture 17: Device interaction and prefetching

The point is to talk about how you can do prefetching
Getting there will take a while

Chickadee devices

x86-64 CPU
Interrupt controllers
Primary memory (vmiter)
AT keyboard controller (keyboardstate)
Console (console)
Power management controllers (PIIX4/ICH9) (poweroff)
Parallel port (log_printf)
SATA (Serial ATA) AHCI (Advanced Host Controller Interface) (ahcistate)

Device interaction

How can the CPU send commands to the device and read results?
How can the device send commands to the CPU?

Port-mapped I/O

Special instructions for communicating with devices
Involves CPU in every device transaction
Example: Reading a character from the keyboard
```
if ((inb(KEYBOARD_STATUSREG) & KEYBOARD_STATUS_READY) == 0) {
    return -1;
}

uint8_t data = inb(KEYBOARD_DATAREG);
```
- This interface is very, very old
- Present in IBM’s 1984 PC/AT
- inb is an instruction
- KEYBOARD_STATUSREG == 0x64 is a fixed address

Memory-mapped I/O

Device presents CPU with an interface that looks like primary memory
Writing/reading memory sends commands/reads results
Device defines the memory layout
Still involves CPU in every device transaction
Example: Color Graphics Adapter (CGA) console

    console = (uint16_t*) 0xB8000; // fixed physical address
    console[0] = 'H' | 0xC000;
    console[1] = 'i' | 0xC000;
    console[2] = '!' | 0xC000;

Devices can combine modes; here, PMIO moves the console cursor

void console_show_cursor(int cpos) { …
    outb(0x3D4, 14);
    outb(0x3D5, cpos / 256);
    outb(0x3D4, 15);
    outb(0x3D5, cpos % 256);
}

Discoverable I/O

Device uses MMIO and/or PMIO, but different layout for different PCs
CPU must first discover this machine’s layout by searching configuration state
Example: Turning off the computer by finding a power controller using the PCI bus

    int addr = pci.find([&] (int a) {
            uint32_t vd = pci.readl(a + pci.config_vendor);
            return vd == 0x71138086U /* PIIX4 Power Management Controller */
                || vd == 0x29188086U /* ICH9 LPC Interface Controller */;
        });
    assert(addr >= 0);
    // Read I/O base register from controller's PCI configuration space.
    int pm_io_base = pci.readl(addr + 0x40) & 0xFFC0;
    // Write `suspend enable` to the power management control register.
    outw(pm_io_base + 4, 0x2000);

Interrupts

Device sends to indicate a change of state
proc::exception

Direct memory access

Device can access the actual primary memory independent of the CPU
Software sets up structures for controlling the device, based on device specifications
Software configures the device with the addresses of these structures, using MMIO and/or PMIO
- Often using physical addresses
- Allowed addresses may be constrained
Software writes commands to structures, may alert device
Device reads commands from structures and writes results back, may alert CPU
Example: AHCI for disks

AHCI overall layout

AHCI reference

AHCI initialization

Search configuration state for an ATA disk
Read address of “memory registers” (MMIO) from configuration state (PMIO)
- Memory registers use reserved physical addresses, live on device
Configure memory registers with DMA structure addresses
- DMA structures are kernel memory
- Part of struct ahcistate

ahcistate* ahcistate::find(int addr, int port) {
    auto& pci = pcistate::get();
    for (; addr >= 0; addr = pci.next(addr), port = 0) {
        if (pci.readw(addr + pci.config_subclass) != 0x0106) {
            continue;
        }
        uint32_t pa = pci.readl(addr + pci.config_bar5);
        if (pa == 0) {
            continue;
        }
        auto dr = pa2ka<volatile regs*>(pa); // memory-mapped I/O
        dr->ghc = ghc_ahci_enable; // indicate host is AHCI aware
        // assume port 0 is available
        assert((dr->port_mask & 1U) && dr->p[0].sstatus);
        return knew<ahcistate>(addr, 0, dr);
    }
}

struct ahcistate { …
    // DMA and memory-mapped I/O state
    dmastate dma_;
    int pci_addr_;
    int sata_port_;
    volatile regs* dr_;
    volatile portregs* pr_;
    …
};

ahcistate::ahcistate(...) { …
    // place port in idle state
    pr_->command &= ~uint32_t(pcmd_rfis_enable | pcmd_start);
    while (pr_->command & (pcmd_command_running | pcmd_rfis_running)) {
        pause();
    }

    // set up DMA area
    memset(&dma_, 0, sizeof(dma_));
    for (int i = 0; i != 32; ++i) {
        dma_.ch[i].cmdtable_pa = ka2pa(&dma_.ct[i]);
    }
    pr_->cmdlist_pa = ka2pa(&dma_.ch[0]);
    pr_->rfis_pa = ka2pa(&dma_.rfis);
    …
}

AHCI port structure (one per drive)

Each drive can process up to 32 commands in parallel
Each command has an entry in the command list and an entry in the command table

Why 32 commands in parallel?

Allow disk to schedule and reorder requests using information only it knows
Serious performance benefits

AHCI command list structure

AHCI command table structure

Reading or writing a block, abstractly

Read: “Dear disk, please read K bytes from disk starting at sector N and write them to memory at address A, and let me know when you have done so”
Write: “Dear disk, please read K bytes from memory starting at address A and write them to disk starting at sector N, and let me know when you have done so”

Reading or writing a block, concretely

Read and write are commands
CPU must:
- Find a slot for the command
- Prepare the slot
- Issue the command
- Wait for response

Reading or writing a block, in code

    clear(0);
    push_buffer(0, buf, sz);
    issue_ncq(0, command, off / sectorsize);

Requires ahcistate::lock_
Expanded (for command cmd_read_fpdma_queued, i.e., read):

    int slot = 0;
    assert(/* `slot` is not currently an active command */);

    // clear
    dma_.ch[slot].nbuf = 0;
    dma_.ch[slot].buf_byte_pos = 0;

    // push_buffer
    dma_.ct[slot].buf[0].pa = ka2pa(buf);
    dma_.ct[slot].buf[0].maxbyte = sz - 1;
    dma_.ch[slot].nbuf = 1;
    dma_.ch[slot].buf_byte_pos = sz;

    // issue_ncq
    size_t first_sector = off / sectorsize;
    size_t nsectors = sz / sectorsize;

    dma_.ct[slot].cfis[0] = cfis_command
        | (unsigned(cmd_read_fpdma_queued) << 16)
        | ((nsectors & 0xFF) << 24);
    dma_.ct[slot].cfis[1] = (first_sector & 0xFFFFFF)
        | (uint32_t(fua) << 31) | 0x40000000U;
    dma_.ct[slot].cfis[2] = (first_sector >> 24)
        | ((nsectors & 0xFF00) << 16);
    dma_.ct[slot].cfis[3] = (slot << 3);

    dma_.ch[slot].flags = 4 /* # words in `cfis` */
        | ch_clear_flag; // would also contain `ch_write_flag` for writes
    dma_.ch[slot].buf_byte_pos = 0;

    // ensure all previous writes have made it out to memory
    std::atomic_thread_fence(std::memory_order_release);

    // tell interface NCQ slot used
    pr_->ncq_active_mask = 1U << slot; // NB!
    // tell interface command available
    pr_->command_mask = 1U << slot;
    // The write to `command_mask` wakes up the device.

I/O parallelism and synchronization

So the command has been issued to the disk. What happens next?
At some point in the future, the disk will perform the command.
When it is done, it signals completion by modifying the MMIO area and alerting the CPU with an interrupt.
While the command is pending, the operating system had better not mess up the command memory!
- Don’t reuse the slot
- Don’t modify the command structures in DMA memory
- Don’t modify the buf

Interrupt processing

Acknowledge to the disk that the interrupt has been received
Mark commands as completed, and therefore available for reuse
Inform kernel thread waiting for result of command

    // obtain lock, read data
    auto irqs = lock_.lock();

    // check interrupt reason, clear interrupt
    assert(/* not error */);
    pr_->interrupt_status = ~0U;
    dr_->interrupt_status = ~0U;

    // acknowledge completed commands
    uint32_t still_active = pr_->ncq_active_mask;
    uint32_t acks = slots_outstanding_mask_ & ~still_active;
    for (int slot = 0; acks != 0; ++slot, acks >>= 1) {
        if (acks & 1) {
            // mark `slot` as successfully completed and available for reuse
            assert(slots_outstanding_mask_ & (1U << slot));
            slots_outstanding_mask_ &= ~(1U << slot);
            ++nslots_available_;
            inform_thread_waiting_for_slot();
        }
    }

    // acknowledge errored commands
    if (is_error) {
        handle_error_interrupt();
    }

    lock_.unlock(irqs); …

Informing the kernel thread: handout code

A volatile int per command communicates command completion
- volatile because an interrupt can happen whenever
ahcistate::wq_ blocks until some command completes

Enqueuing command:

// send command, record buffer and status storage
int slot = 0;
...clear, push_buffer, issue_ncq...
volatile int r = E_AGAIN;
slot_status_[0] = &r; // NB kernel-only, not DMA or MMIO
…
waiter(p).block_until(wq_, [&] () {
    return r != E_AGAIN;
});

Processing interrupt:

…
        slots_outstanding_mask_ &= ~(1U << slot);
        ++nslots_available_;
        if (slot_status_[slot]) {
            *slot_status_[slot] = 0; // means success
            slot_status_[slot] = nullptr;
        }

What does prefetching mean?

Blocking read: A kernel thread needs data and cannot continue until it is available
Prefetching read: Read in advance of the data being required
- No kernel thread is blocking on the data!

Separating concerns in prefetching

Prefetching policy
- What to prefetch and when?
- Often file system- and workload-specific
- See posix_fadvise, madvise
Prefetching implementation
- How to prefetch?

Prefetching thread

Add a new prefetching kernel thread to the system
It prefetches blocks using blocking calls
Prefetching policy
- Stores block numbers to prefetch somewhere (lock-protected)
- Wakes up the prefetching thread
Prefetching thread
- Blocks waiting for block numbers
- Prefetches blocks as they arrive
Advantages?
Disadvantages?

Multiple prefetching threads

Modify ahcistate::read_or_write so it can use any NCQ slot
Add multiple prefetching kernel threads to the system
Advantages?
Disadvantages?

Nonblocking prefetching

Add a new function ahcistate::read_or_write_nonblocking that can use any NCQ slot
Prefetching policy
- Fires off new nonblocking prefetching requests
Must ensure a mechanism to determine prefetching completion

Example: Extend `bcentry`

struct bcentry {
    std::atomic<int> state_ = state_empty;
    spinlock lock_;
    blocknum_t bn_;
    std::atomic<unsigned> ref_ = 0;

    volatile int ready_;  // new

Why volatile int?
What code is writing this value?
What code is reading this value?

Handout code: Waiting for a `bcentry` to read

Problem: Multiple kernel threads might want to read the same block
Solution: Synchronization within bcentry::load

    // load block, or wait for concurrent reader to load it
    while (true) {
        assert(state_ != state_empty);
        if (state_ == state_allocated) { …
            state_ = state_loading;
            lock_.unlock(irqs);

            sata_disk->read(buf_, chkfs::blocksize,
                            bn_ * chkfs::blocksize);
            // which means:
            sata_disk->read_or_write
                (ahcistate::cmd_read_fpdma_queued,
                 buf_, chkfs::blocksize, bn_ * chkfs::blocksize);

            irqs = lock_.lock();
            …
        } else if (state_ == state_loading) {
            waiter(current()).block_until(bc.read_wq_, [&] () {
                    return state_ != state_loading;
                }, lock_, irqs);
        } else …
    }

Nonblocking `bcentry::load`?

    while (true) {
        if (state_ == state_allocated) { …
            state_ = state_loading;
            ready_ = E_AGAIN;

            sata_disk->read_or_write_nonblocking
                (ahcistate::cmd_read_fpdma_queued,
                 buf_, chkfs::blocksize, bn_ * chkfs::blocksize,
                 &ready_);

            // ????????
            …

Lecture 17: Device interaction and prefetching

Chickadee devices

Device interaction

Port-mapped I/O

Memory-mapped I/O

Discoverable I/O

Interrupts

Direct memory access

AHCI overall layout

AHCI initialization

AHCI port structure (one per drive)

Why 32 commands in parallel?

AHCI command list structure

AHCI command table structure

Reading or writing a block, abstractly

Reading or writing a block, concretely

Reading or writing a block, in code

I/O parallelism and synchronization

Interrupt processing

Informing the kernel thread: handout code

What does prefetching mean?

Separating concerns in prefetching

Prefetching thread

Multiple prefetching threads

Nonblocking prefetching

Example: Extend bcentry

Handout code: Waiting for a bcentry to read

Nonblocking bcentry::load?

Example: Extend `bcentry`

Handout code: Waiting for a `bcentry` to read

Nonblocking `bcentry::load`?