[Kernel, courtesy IowaFarmer.com CornCam]

CS 235 Advanced Operating Systems, Fall 2010

Lab 5: File System and Shell

Handed out Saturday, November 20, 2010
Due Friday, December 10, 2010

Introduction

In this lab, you will implement an exokernel-style file system library, client-side file descriptors, and a Unix-like command shell! The file system uses a shared buffer cache server implemented, in microkernel fashion, using a user-space environment. Other environments access disk blocks by making IPC requests to this special file system environment.

Lab Requirements

You will need to do all of the regular exercises described in the lab. If you complete a challenge problem, provide a short (e.g., one or two paragraph) description of what you did. Place the write-up in a file called answers.txt (plain text) or answers.html (HTML format) in the top level of your lab5 directory before handing in your work to CourseWeb.

Alternately, you may skip some or all of Lab 5 in favor of working on a substantial challenge problem you define on your own. I am excited to see what you come up with! Please contact me with your challenge problem idea so I can help you push it in interesting directions. Challenge problems may be completed by teams of two students. Let me know if you plan to work in a team, and what challenge problem(s) you are interested in.

Merging Lab 5

To fetch the new source, use Git to commit your lab 4 and save that code in your lab4 branch, which should have been created in the last lab. Then fetch the latest version of the course repository, and update your local branch based on our lab5 branch, origin/lab5:

% git branch
... Check that your current branch is 'lab4'.
% git commit -am 'my solution to lab4'
% git fetch
% git checkout -b lab5 origin/lab5
Branch lab5 set up to track remote branch refs/remotes/origin/lab5.
Switched to a new branch "lab5"
% git merge lab4

Or you can download the tarball if you'd like.

Unfortunately, you will need to change code from prior labs. Here is how to resolve some of these changes.

inc/memlayout.h
A user-level definition of struct Page has been introduced to inc/memlayout.h. Your kernel's struct Page in kern/pmap.h, which you may have changed for a previous challenge, must have the same definition. Your kernel will fail to compile if the structures have different sizes. It is safe to move your kern/pmap.h version into inc/memlayout.h. If you do, delete the #define Page UserPage line in inc/memlayout.h and the new static_assert in kern/pmap.c.

Page fault handlers

User-level page fault handlers now support chaining. More than one handler can be installed. The _pgfault_handler function in lib/pgfault.c tries them in order. The first handler that returns non-zero handled the fault: later handlers are skipped and the environment is restarted. If all handlers return 0, the environment panics.

You will need to change your lib/fork.c handler to return 0, rather than panic, if the faulting access was not a write to a copy-on-write page. The handler should return 1 on success.

The git merge command will tell you which files are conflicted, and you should first resolve the conflict (by editing the relevant files) and then commit the resulting files with git commit -a.

Before you start lab 5, make sure that your lab 4 code is still working. make grade-lab4 should give you full credit.

File system code

The JOS file system is implemented in exokernel fashion. A complete file system implementation is linked into each user environment. The disk's contents are mapped into memory that can be shared by any environment. A special buffer cache environment is given the I/O privilege necessary to read and write the disk hardware. The buffer cache accesses the disk on demand and uses JOS's IPC mechanism to share the resulting blocks with other environments. The buffer cache can also lock individual disk blocks, so environments can avoid race conditions.

This exokernel style is unusual. Modern file systems are implemented either inside monolithic kernels or as separate processes (a more microkernel-like style). JOS's design has many issues. Every environment has effective write access to the whole disk (XN solved this problem, but you aren't implementing XN, thank gosh). An unexpected environment crash or infinite loop can leave a block in the locked state indefinitely. However, the exokernel style fits well with the rest of JOS, and lets us focus more on file system internals than (for example) the details of a client-server IPC mechanism.

This file system design is totally new this quarter.

The code for the file system is arranged as follows:

inc/fs.h
Structure and constant definitions for the file system layout (which is shared by both memory and disk), and macros for communication with the buffer cache.
inc/fd.h
The JOS file descriptor interface. File descriptors use an object-oriented design. The struct Dev structure defines operations relevant for a class of file descriptors. File descriptor operations like read() call out to struct Dev class-specific methods to do most of the work. We provide three descriptor classes: file system descriptors, console descriptors, and pipe descriptors.
inc/lib.h
Declares new file descriptor functions like read().
lib/fd.c
File descriptor implementation.
lib/file.c
File system implementation, including the struct Dev methods for file system descriptors.
fs/
Code for the buffer cache.
fs/ide.h, fs/ide.c (no exercises)
Code for accessing an IDE hard drive using programmed I/O instructions.
fs/bufcache.c (no exercises)
The buffer cache environment.

Sectors and blocks

Disks perform reads and writes in units of sectors, which today are almost universally 512 bytes each. File systems, though, allocate and use disk storage in units of blocks. Be wary of the distinction between the two terms: sector size is a property of the disk hardware, whereas block size is an aspect of the operating system using the disk. A file system's block size must be at least the sector size of the underlying disk, but could be greater.

The original UNIX file system used a block size of 512 bytes, the same as the sector size of the underlying disk. Most modern file systems use a larger block size, however, because storage space has gotten much cheaper and it is more efficient to manage storage at larger granularities. Our file system will use a block size of 4096 bytes to match the processor's page size.

The JOS buffer cache

JOS user environments access the file system through a combination buffer cache and lock server called bufcache. You have no exercises to complete in bufcache, but you need to understand its interface (and, of course, you may be interested in it anyway).

The buffer cache responds to IPC requests from other environments. Each request contains a block number and a request type. For most requests, the buffer cache responds by sending back a shared-memory page with the corresponding disk block. All pages are sent with PTE_P|PTE_U|PTE_W|PTE_SHARE permission (PTE_SHARE is described later in the lab). The accompanying IPC value is ≥ 0 on success and an error code < 0 on error.

The simplest requests simply read and write disk blocks.

BCREQ_MAP
Return the block's contents. This request always succeeds, even if the block is locked.
BCREQ_FLUSH
Write the current contents of the block out to disk (because another environment has changed the corresponding memory page). Does not return a page.

File system users also coordinate using the buffer cache, using its per-block advisory locks. The locks are called advisory because any environment can always get read-write access to any page with BCREQ_MAP. However, file system implementations coordinate their updates by explicitly locking the corresponding blocks. For instance, reads from and writes to a given file should only happen while the corresponding inode block is locked. The requests are as follows.

BCREQ_MAP_WLOCK
Return the block's contents and obtain an exclusive lock. If the block is currently locked by some other environment, the buffer cache will delay its response until that environment unlocks. (The lock queue can hold up to 8 environments per block; the 9th and later environments are rejected immediately with an -E_AGAIN error.)
BCREQ_UNLOCK
Unlock a block previously locked by BCREQ_MAP_WLOCK. Does not return the block's contents. The buffer cache checks to make sure that the unlocking environment actually held a lock.
BCREQ_UNLOCK_FLUSH
Combines the effects of UNLOCK and FLUSH.

Other IPCs: You probably won't need to use these.

The buffer cache supports shared locks with BCREQ_MAP_RLOCK. This could be useful for some operations (for instance, letting multiple environments read the same file at once), but we recommend you rely on BCREQ_MAP_WLOCK at first.

The buffer cache also tracks whether blocks have been initialized. Each block has an initialization state that starts at 0. A BCREQ_MAP_[RW]LOCK IPC returns the corresponding block's initialization state. The BCREQ_INITIALIZE IPC sets a block's initialization state to 1. The buffer cache remembers initialization states for as long as it runs. You won't need to manipulate initialization states in the regular exercises.

File system data structures

The file system you will work with is much simpler than most "real" file systems, but it is powerful enough to provide the basics: creating, reading, writing, and deleting files organized in a hierarchical directory structure. Since JOS is a "single-user" operating system, our file system doesn't support the UNIX notions of file ownership or permissions. It also currently does not support hard links, symbolic links, time stamps, or special device files.

Most UNIX file systems divide available disk space into two main types of regions: inode regions and data regions. Each file corresponds to one inode, which holds critical metadata about the file such as its stat attributes and pointers to its data blocks. The data regions are divided into much larger (typically 8KB or more) data blocks, within which the file system stores file data and directory metadata. Directory entries contain file names and pointers to inodes; a file is said to be hard-linked if multiple directory entries in the file system refer to that file's inode.

Both files and directories logically consist of a series of data blocks, which may be scattered throughout the disk much like the pages of an environment's virtual address space can be scattered throughout physical memory. User processes can read and write the contents of files directly, but the file system handles all modifications to directories itself as a part of actions such as file creation and deletion. Our file system does, however, allow user environments to read directory metadata directly (e.g., with read), so user environments can perform directory scanning operations themselves (e.g., to implement the ls program). The disadvantage of this approach to directory scanning, and the reason most modern UNIX variants discourage it, is that it makes application programs dependent on the format of directory metadata, making it difficult to change the file system's internal layout without changing or at least recompiling application programs as well.

Superblocks

Layout for JOS file system with N blocks and I inodes. N must be at least 1+⌈N/4096⌉+I. Only I-1 inode blocks are required because 0 is an invalid inode number, so inode 0 isn't stored.

File systems typically reserve certain disk blocks, at "easy-to-find" locations on the disk such as the very start or the very end, to hold metadata describing properties of the file system as a whole, such as the block size, disk size, any metadata required to find the root directory, the time the file system was last mounted, the time the file system was last checked for errors, and so on. These special blocks are called superblocks.

Our file system's superblock layout is defined by struct Super in inc/fs.h. The file system superblock will always occupy block 1 on the disk; boot loaders and partition tables use block 0, so most file systems don't use the very first disk block. Many "real" file systems maintain multiple superblocks, replicated throughout several widely-spaced regions of the disk, so that if one of them is corrupted or the disk develops a media error in that region, the other superblocks can still be found and used to access the file system.

Freemap: Managing block allocation

Just as the kernel manages physical memory allocation so that physical pages aren't inappropriately reused, a file system must manage disk blocks to ensure that a given block is used for only one purpose at a time. Many file systems keep track of free disk blocks using a bitmap rather than a linked list of free blocks. A bitmap simplifies block placement (finding a free block in a particular disk region), is simple to manage and keep consistent, and can be loaded into memory with few seeks. Though some operations are slow with a bitmap—it can take O(N) time to find a free block—they can be sped up using auxiliary memory data structures.

The JOS file system tracks whether each block is allocated using an array of bytes, not bits. The Ith byte in the freemap data structure is 1 iff block I is free. Using bytes rather than bits wastes space, but makes freemap operations much easier to code (no bit swizzling). To set up a freemap, we reserve a contiguous region of blocks large enough to hold one byte for each disk block, starting at block 2 (just after the superblock). Thus, we must reserve one block for the freemap for every 4096 blocks in the file system. Note that the freemap includes bytes for all blocks, including the superblock and the freemap itself. The bytes for these special blocks are set to 0, indicating that the corresponding blocks are in use.

Inodes

The layout of a JOS inode is described by struct Inode in inc/fs.h. The inode includes the file's size, type (regular file or directory), reference count, and pointers to the blocks comprising the file. For simplicity we will use this one Inode structure to represent file metadata as it appears both on disk and in memory. Some of its fields are only meaningful in memory, and might have garbage values on disk; we must initialize these fields whenever we read a Inode structure into memory for the first time. (That's what BCREQ_INITIALIZE is for.)

Each Inode contains two reference counts. First, i_refcount is the "true" reference count; it measures the number of hard links (directory entries) pointing to the inode. (For the root directory, it is 1.) In contrast, the i_opencount value is only valid in memory. It counts the number of references to an inode from any currently running process. A file's data blocks are not reclaimed until its inode is unreferenced from the filesystem, and no process has the file open.

A single Inode structure is 4096 bytes big. This is much larger than for most file systems, and wastes a lot of space for small files. However, since the buffer cache's unit of locking is a single block, it is extremely convenient to have a separate block for each inode.

The i_direct array in struct Inode contains space to store the block numbers for the file's data blocks. There are 1018 direct pointers, limiting files to at most 4169728 bytes. Most Unix-like file systems also use indirect pointers (and doubly-indirect, triply-indirect, and so on) to support larger files. You may implement these pointers for a challenge, but our inodes are big enough to support pretty large files without indirect pointers.

If an inode is unreferenced (i_refcount == 0 && i_opencount == 0), then the rest of its contents are ignored. In particular any nonzero i_direct data blocks may be free or used by other inodes.

Inode 1 corresponds to the file system's root directory. All other inodes have numbers 2 or higher.

Directories and regular files

An Inode in our file system can represent either a regular file or a directory; these two types of "files" are distinguished by the i_ftype field. The file system manages regular files and directory-files in exactly the same way, except that it does not interpret the contents of the data blocks associated with regular files at all, whereas the file system interprets the contents of a directory-file as a series of Direntry structures describing the files and subdirectories within the directory.

Each Direntry contains a file name (de_name + de_namelen) and an inode number (de_inum). The file name is only valid if de_inum is nonzero.

Part A: The File System

Disk Access

The file system server in our operating system needs to be able to access the disk, but we have not yet implemented any disk access functionality in our kernel. Instead of taking the conventional "monolithic" operating system strategy of adding an IDE disk driver to the kernel along with the necessary system calls to allow the file system to access it, we will instead implement the IDE disk driver as part of the user-level buffer cache environment. We will still need to modify the kernel slightly, in order to set things up so that the buffer cache has the privileges it needs to implement disk access itself.

It is easy to implement disk access in user space this way as long as we rely on polling, "programmed I/O" (PIO)-based disk access and do not use disk interrupts. It is possible to implement interrupt-driven device drivers in user mode as well (the L3 and L4 kernels do this, for example), but it is more difficult since the kernel must field device interrupts and dispatch them to the correct user-mode environment.

The x86 processor uses the IOPL bits in the EFLAGS register to determine whether protected-mode code is allowed to perform special device I/O instructions, such as IN and OUT. The IOPL bits equal the minimum (i.e. numerically highest) privilege level allowed to perform IN and OUT instructions, so if those bits are 0, only the kernel can execute INs and OUTs. All of the IDE disk registers we need to access are located in the x86's I/O space (rather than memory-mapped I/O space), so to let the file system environment access the disk, all we need to do is manipulate the IOPL bits. But no other environment should be able to access I/O space.

To keep things simple, from now on we will arrange things so that the buffer cache always has ID ENVID_BUFCACHE.

Exercise 0. Did you resolve struct Page and the new chaining page fault handlers (see "Merging Lab 5" above)? Just checking!

Exercise 1. Modify your kernel's environment initialization function, env_alloc in env.c, so that it gives environment ENVID_BUFCACHE I/O privilege, but never gives that privilege to any other environment.

After this exercise, make run-testfile should print a message "bufcache can do I/O".

Do you have to do anything else to ensure that this I/O privilege setting is saved and restored properly when you subsequently switch from one environment to another? Make sure you understand how this environment state is handled.

This lab uses the file obj/kernel.img as the image for disk 0 (typically "Drive C" under DOS/Windows) as before, and to the (new) file obj/fs.img as the image for disk 1 ("Drive D"). In this lab your file system should only ever touch disk 1; disk 0 is used only to boot the kernel. If you manage to corrupt either disk image in some way, you can reset both of them to their original, "pristine" versions simply by typing:

$ rm obj/kernel.img obj/fs.img
$ make
Challenge! Implement interrupt-driven IDE disk access, with or without DMA. You can decide whether to move the device driver into the kernel, keep it in user space along with the file system, or even (if you really want to get into the microkernel spirit) move it into a separate environment of its own.

Demand-paged buffer cache

The main JOS buffer cache is stored, of course, in the buffer cache environment. The 2GB region of virtual address space from 0x50000000 (DISKMAP) up to 0xD0000000 (DISKMAP + DISKSIZE) is reserved to map disk pages. These pages are read on demand based on IPC requests.

For simplicity, other user environments use the same virtual memory region to map buffer cache blocks, although they use different names (FSMAP and FSMAP + DISKSIZE). These blocks are demand paged. If a page fault happens in the file system region, a page fault handler will load the corresponding page from the buffer cache by IPC.

Exercise 2. Implement the bcache_pgfault_handler function in lib/file.c.

testfile should print "initial fsck is good" when you get this right.

File descriptors

Unix file descriptors are a general notion that encompasses file I/O, pipes, console I/O, etc. In JOS, each of these device types has a corresponding struct Dev, with pointers to the functions that implement read/write/etc. for that device type. (Thus, struct Dev is like an object-oriented class.) lib/fd.c implements the general Unix-like file descriptor interface on top of this. Each struct Fd indicates its device type, and most of the functions in lib/fd.c simply dispatch operations to functions in the appropriate struct Dev.

lib/fd.c also maintains the file descriptor table region in each environment's address space, starting at FDTABLE. This area reserves a page's worth (4KB) of address space for each of the up to NFD (currently 32) file descriptors the application can have open at once. At any given time, a particular file descriptor table page is mapped if and only if the corresponding file descriptor is in use. Each file descriptor also has an optional "data page" in the region starting at FDDATA, which we will use for pipes.

For nearly all interactions with files, user code will go through the functions in lib/fd.c.

Exercise 3 (no code). Look over and analyze the code in inc/fd.h and lib/fd.c. To check your understanding, see if you can answer some questions (no need to write up the answers): When is memory mapped at a location in FDTABLE? What virtual memory features does fd_find_unused rely on? If file descriptors were implemented as C++ or Java classes, what would be their virtual functions?

File system interface

Each device type has at least one user-visible function that cannot be implemented generically: the function for opening a new file descriptor. Now you must implement this, and the rest of the incomplete functions in lib/file.c. When you're done, you'll have a working file system!

You will use the buffer cache's locking primitives to prevent race conditions between environments. To minimize the risk of deadlock, you should ensure that locks are held only during the execution of lib/file.c functions. In other words, no locks should be held when one of these interface functions returns to its caller.

The locking protocol makes sure that important file system metadata doesn't change during an operation. For example, it ensures that the Direntry pointer returned by dir_walk is actually for the right file name. (Without locking, another environment could potentially run, unlink the name, and create another file at the same position in the directory, creating a race.) It also ensures that the file size does not change during a read operation, and that write operations do not conflict.

Exercise 4. Start implementing open in lib/file.c. It must find an unused file descriptor using fd_find_unused(), walk the path hierarchy, open the corresponding inode, and create a new file descriptor on success. Be sure your code fails gracefully if the maximum number of files are already open, or if any of the IPC requests to the file server fail.

For hints on style, consider the unlink implementation.

testfile should now pass the open and file_stat tests.

You will now fill out the rest of the missing pieces of the file implementation. Feel free to work in any order, but the exercises guide you to pass the testfile tests in order.

Exercise 5. Implement devfile_read in lib/file.c.

testfile should now pass the file_read and file_read across a block boundary tests.

Exercise 6. Implement devfile_write in lib/file.c.

testfile should now pass the file_write and file_read after file_write tests.

Exercise 7. Implement block allocation and inode initialization. Write block_alloc in lib/file.c and add O_CREAT support to open.

testfile should now pass the file_write create and file_read after file_write create tests.

Exercise 8. Implement block freeing. Complete inode_close, devfile_close, and inode_set_size in lib/file.c and add O_TRUNC support to open.

testfile should now pass the final fsck test.

Challenge! The JOS file system locking protocol depends on the fact that different inodes are given different locks. If a lock covered more than one struct Inode, we would risk deadlock: two processes executing path_walk concurrently could create a circular wait. Fix this, and support smaller inodes, by writing a more generic lock server.

Challenge! The buffer cache cannot recover if an environment dies while holding a lock. Implement a revocation protocol that allows it to reclaim locks. The simplest protocol would simply check whether an environment had died while holding a lock, and revoke the lock if so. A more complex protocol might explicitly revoke locks from environments after some amount of time (a technique related to leases).

Challenge! Change the lib/file.c locking protocol to use read locks when possible. Make sure that your locking protocol prevents race conditions and avoids deadlock. Some operations will still require write locks---maybe more than you'd first expect. For example, the exclusive locks obtained during read operations also protect the file descriptor f_offset field from race conditions on concurrent updates---each read call updates the f_offset independently. If you switch read to use read locks, you'll need to protect f_offset a different way.

Challenge! Change the lib/file.c locking protocol to use read-copy-update when possible.

Challenge! The block cache has no eviction policy. Once a block is read, it never gets removed and will remain in memory forever. Add eviction to the buffer cache and lib/file.c. The buffer cache cannot evict a page that is still mapped by another environment, but it should be able to evict any other page. You may need to add additional IPC calls to allow environments to suggest pages to evict. In those environments, page table "accessed" bits, which the hardware sets on any access to a page, can can track approximate usage of disk blocks without the need to modify every place in the code that accesses the disk map region. Be careful with dirty blocks.

Challenge! The file system code uses synchronous writes to keep the file system fairly consistent in the event of a crash. Implement soft updates or journaling instead.

Challenge! Implement an XN-like system to protect disk blocks from inappropriate updates.

Challenge! Add support to the file server and the client-side code for files greater than 4MB in size.

Challenge! Add file system interface functions to create hard links and subdirectories.

Challenge! Change the file system design to support more than one file descriptor per page.

Challenge! Implement the file system in a microkernel-like design. This will require major IPC changes as well as buffer cache changes.

Part B: Spawning Processes from the File System

In this exercise, you'll extend spawn from Lab 4 to load program images from the file system as well as from kernel binary images. If spawn is passed a binary name like "/ls" that begins with a slash, it will read the program data from disk; otherwise, it will read the program data from the kernel. Luckily, this requires just a couple of changes.

Exercise 9. Change your spawn in lib/spawn.c to open and read from a file, rather than looking up a kernel binary with sys_program_lookup, if the first character of progname is a slash '/'. Also close the file descriptor before exiting. See the new "LAB 5 EXERCISE" comment.

Use make run-icode to test your code. This program spawns off init, which should print some messages like this:

icode: close /motd
icode: spawn /init
[00001001] new env 00001002
init: running
init: data seems okay
init: bss seems okay
init: args: 'init' 'initarg1' 'initarg2'
init: exiting
[00001002] exiting gracefully
[00001002] free env 00001002
icode: exiting
[00001001] exiting gracefully
[00001001] free env 00001001

Sharing pages between environments

We would like to share file descriptor state across fork and spawn, but file descriptor state is kept in user-space memory. Right now, on fork, the memory will be marked copy-on-write, so the state will be duplicated rather than shared. (This means that running "(date; ls) >file" will not work properly, because even though date updates its own file offset, ls will not see the change.) On spawn, the memory will be left behind, not copied at all. (Effectively, the spawned environment starts with no open file descriptors.)

We will change both fork and spawn to know that certain regions of memory are used by the "library operating system" and should always be shared. Rather than hard-code a list of regions somewhere, we will set an otherwise-unused bit in the page table entries (just like we did with the PTE_COW bit in fork).

We have defined a new PTE_SHARE bit in inc/lib.h. If a page table entry has this bit set, then by convention, the PTE should be copied directly from parent to child in both fork and spawn. Note that this is different from marking it copy-on-write: as described in the first paragraph, we want to make sure to share updates to the page.

Exercise 10. Change your duppage code in lib/fork.c to follow the new convention. If the page table entry has the PTE_SHARE bit set, just copy the mapping directly, regardless of whether it is marked writable or copy-on-write. (This could be a one-line change, depending on your current code!)

Exercise 11. Change spawn in lib/spawn.c to propagate PTE_SHARE pages. After it finishes setting up the child virtual address space but before it marks the child runnable, it should call copy_shared_pages, which loops through all the page table entries in the current process, copying any mappings that have the PTE_SHARE bit set. You'll just need to modify spawn so that it calls copy_shared_pages (a one-line change). Make sure that you copy the shared pages very near the end of the function, after closing the file descriptor corresponding to the ELF binary! (Why?)

Use make run-testpteshare to check that your code is behaving properly. You should see lines that say "fork handles PTE_SHARE right" and "spawn handles PTE_SHARE right".

Use make run-testfdsharing to check that file descriptors are shared properly. You should see lines that say "read in child succeeded" and "read in parent succeeded".

Part C: A Shell

In this part of the lab, you'll extend JOS to handle everything necessary to support a shell. We've done a lot of the work for you, but you must (1) make it possible to share file descriptors across environments, (2) clean up a couple loose ends, and (3) implement file redirection in the shell.

Before going further, enable keyboard interrupt handling in your kernel.

Exercise 12. Change trap in kern/trap.c to call kbd_intr() every time interrupt number IRQ_OFFSET+1 occurs. (This should be a three-line change.)

At this point, you can use make run-initsh to boot into the current version of the shell, which can already do simple commands like "ls". As you progress through the lab, the shell will become more functional, and you will be able to do things like add redirections.

Pipes

Pipes and the console are both I/O stream interfaces. This means that they support reading and/or writing, but not file positions. Like Unix, JOS represents these streams using file descriptors. To support this, the file descriptor subsystem uses a simple virtual file system layer, implemented by struct Dev, so that disk files, console files, and pipes all implement the same file descriptor functions.

A pipe is a shared data buffer accessed via two file descriptors, one for writing data into the pipe and one for reading data out of it. Unix command lines like "ls | sort" use pipes. The shell creates a pipe, hooks up ls's standard output to the write end of the pipe, and hooks up sort's standard input to the read end of the pipe. As a result, ls's output is processed by sort. You may want to read the pipe manual page for background, and the pipe section of Dennis Ritchie's UNIX history paper for interesting history.

In Unix-like designs, each pipe's shared data buffer is stored in the kernel. Of course, this is not how we implement pipes on an exokernel! Your library operating system represents a pipe, including its shared buffer, by a single struct Pipe. The struct Pipe is stored on its own page to make sharing easier, and mapped into the file mapping area of both the reading and the writing file descriptor. Here's the structure:

#define PIPEBUFSIZ 32
struct Pipe {
        off_t p_rpos;                    // read position
        off_t p_wpos;                    // write position
        uint8_t p_buf[PIPEBUFSIZ];       // shared buffer
};

This is a simple lock-free queue structure. The pipe starts in this state:

p_rpos = 0 ---+
p_wpos = 0 ---|+
              VV
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
    p_buf:  |   |   |   |   |   |   |   |       |   |   |   |   |
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
              0   1   2   3   4   5   6           28  29  30  31

The bytes written to the pipe can be thought of as numbered starting from 0. The write position p_wpos gives the number of the next byte that will be written, and the read position p_rpos gives the number of the next byte to be read. After a writer writes "abc" to the pipe, it will enter this state:

p_rpos = 0 ---+
p_wpos = 3 ---|-----------+
              V           V
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
    p_buf:  | a | b | c |   |   |   |   |       |   |   |   |   |
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
              0   1   2   3   4   5   6           28  29  30  31

Since p_rpos != p_wpos, the pipe contains data. The next read from the pipe will return the next 3 characters. For example, after a read() of one byte:

p_rpos = 1 -------+
p_wpos = 3 -------|-------+
                  V       V
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
    p_buf:  |   | b | c |   |   |   |   |       |   |   |   |   |
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
              0   1   2   3   4   5   6           28  29  30  31

This data structure is safe for concurrent updates as long as there is a single reader and a single writer, since only the reader updates p_rpos and only the writer updates p_wpos.

Since the pipe buffer is not infinite, byte i is stored in pipe buffer index i % PIPEBUFSIZ. Thus, after a couple reads and writes, the pipe might enter this state:

p_rpos = 30 ----------------------------------------------+
p_wpos = 33 ------+                                       |
                  V                                       V
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
    p_buf:  | $ |   |   |   |   |   |   |       |   |   | ! | @ |
            +---+---+---+---+---+---+---+- ... -+---+---+---+---+
              0   1   2   3   4   5   6           28  29  30  31

Note that byte 32 was stored in slot 0.

If p_rpos == p_wpos, the pipe is empty. Any read call should yield until a writer adds information to the pipe. Similarly, if p_wpos - p_rpos == PIPEBUFSIZ, the pipe is full. Any write call should yield until a reader opens up some space in the pipe.

Closed Pipes

There is a catch -- maybe we are trying to read from an empty pipe but all the writers have exited. Then there is no chance that there will ever be more data in the pipe, so waiting is futile. In such a case, Unix signals end-of-file by returning 0. So will we. To detect that there are no writers left, we could put reader and writer counts into the pipe structure and update them every time we fork or spawn and every time an environment exits. This is fragile -- what if the environment doesn't exit cleanly? Instead we can use the kernel's page reference counts, which are guaranteed to be accurate.

Recall that the kernel page structures are mapped read-only in user environments. The library function pageref(void *ptr) returns the number of page table references to the page containing the virtual address ptr. It works by first examining vpt[] to find ptr's physical address, then looking up the relevant struct Page in the UPAGES array and returning its pp_ref field. So, for example, if fd is a pointer to a particular struct Fd, pageref(fd) will tell us how many different references there are to that structure.

Three pages are allocated for each pipe: the struct Fd for the reading file descriptor rfd, the struct Fd for the writing file descriptor wfd, and the struct Pipe p shared by both. The struct Pipe page is mapped once per file descriptor reference. Thus, the following equation holds: pageref(rfd) + pageref(wfd) = pageref(p). A reader can check whether there are any writers left by examining these counts. If pageref(p) == pageref(rfd), then pageref(wfd) == 0, and there are no more writers. A writer can check for readers in the same manner.

Exercise 13. Implement pipes in lib/pipe.c. We've included the code for reading from a pipe for you. You must write the code for writing to a pipe, and the code for testing whether a pipe is closed. Run make run-testpipe to check your work; you should see a line "pipe tests passed".

Pipe Races

File descriptor structures use shared memory that is written concurrently by multiple processes. That creepy shiver that just ran up your back is justified: this kind of situation is ripe with race conditions. We've made one race condition, concerning pipes, particularly easy to run into.

The race is that the two calls to pageref() in _pipeisclosed might not happen atomically. If another process duplicates or closes the file descriptor page between the two calls, the comparison will be meaningless. To make it concrete, suppose that we run:

	pipe(p);
	if (fork() == 0) {
		close(p[1]);
		read(p[0], buf, sizeof(buf));
	} else {
		close(p[0]);
		write(p[1], msg, strlen(msg));
	}

The following might happen:

  1. The child runs first after the fork. It closes p[1] and then tries to read from p[0]. The pipe is empty, so read checks to see whether the pipe is closed before yielding. Inside _pipeisclosed, pageref(fd) returns 2 (both the parent and the child have p[0] open), but then a clock interrupt happens.
  2. Now the kernel chooses to run the parent for a little while. The parent closes p[0] and writes msg into the pipe. Msg is very long, so the write yields halfway to let a reader (the child) empty the pipe.
  3. Back in the child, _pipeisclosed continues. It calls pageref(p), which returns 2 (the child has a reference associated with p[0], and the parent has a reference associated with p[1]). The counts match, so _pipeisclosed reports that the pipe is closed. Oops.

Run "make run-testpiperace2" to see this race in action. You should see "RACE: pipe appears closed" when the race occurs.

This race isn't that hard to fix. Comparing the counts can only be incorrect if another environment ran between when we looked up the first count and when we looked up the second count. In other words, we need to make sure that _pipeisclosed executes atomically. Since it doesn't change any variables, we can simply rerun it until it runs without being interrupted; the code is so short that it will usually not be interrupted.

But how can we tell whether our environment has been interrupted? In the uniprocessor JOS kernel, this can be simple: just check the env_runs variable in our environment structure. Each time the kernel runs an environment, it increments that environment's env_runs. Thus, user code can record env->env_runs, do its computation, and then look at env->env_runs again. If env_runs didn't change, then the environment was not interrupted. Conversely, if env_runs did change, then the environment was interrupted.

Exercise 14. Change _pipeisclosed to repeat the check until it completes without interruption. Print "pipe race avoided\n" when you notice an interrupt and the check would have returned 1 (erroneously indicating that the pipe was closed).

Run "make run-testpiperace2" to check whether the race still happens. If it's gone, you should not see "RACE: pipe appears closed", and you should see "race didn't happen". You should also see plenty of your "avoided" messages, indicating places where the race would have happened if you weren't being so careful. (The number of "avoided" messages depends on the ips value in your .bochsrc.)

Challenge! Write a test program that demonstrates one of the other races, such as a race between multiple readers of a single pipe.

Challenge! Fix all these races!

The shell itself

Run make run-initsh. This will run your kernel starting user/initsh, which sets up the console as file descriptors 0 and 1 (standard input and standard output), then spawns sh, the shell. Run ls and cat lorem.

Exercise 15. The shell can only run simple commands. It has no redirection or pipes. It is your job to add these. Flesh out user/sh.c.

Once your shell is working, you should be able to run the following commands:

echo hello world | cat
cat lorem >out
cat out
cat lorem |num
cat lorem |num |num |num |num |num
lsfd
cat script
sh <script

Note that the user library routine printf prints straight to the console, without using the file descriptor code. This is great for debugging but not great for piping into other programs. To print output to a particular file descriptor (for example, 1, standard output), use fprintf(1, "...", ...). See user/ls.c for examples.

Run make run-testshell to test your shell. Testshell simply feeds the above commands (also found in fs/testshell.sh) into the shell and then checks that the output matches fs/testshell.key.

Challenge! Add more features to the shell. Some possibilities include:
  • backgrounding commands (ls &)
  • multiple commands per line (ls; echo hi)
  • command grouping ((ls; echo hi) | cat > out)
  • environment variable expansion (echo $hello)
  • quoting (echo "a | b")
  • command-line history and/or editing
  • tab completion
  • directories, cd, and a PATH for command-lookup.
  • file creation
  • ctl-c to kill the running environment
but feel free to do something not on this list. Be creative.

Challenge! There is a bug in our disk file implementation related to multiple programs writing to the same file descriptor. Suppose they are properly sequenced to avoid simultaneous writes (for example, running "(ls; ls; ls; ls) >file" would be properly sequenced since there's only one writer at a time). Even then, this is likely to cause a page fault in one of the ls instances during a write. Identify the reason and fix this.

This completes the course!

Back to CS 235 Advanced Operating Systems