Problem set 5 – CS 1610

A. Threads
B. Journaling (Optional)
C. Project

Turnin

Fill out psets/pset5answers.md and psets/pset5collab.md and push to GitHub. Then submit on the grading server.

You must complete all parts by Wednesday, May 7 at 11:59pm. There will be no further extensions beyond May 7.

A. Threads

Implement multithreaded processes: processes that can have multiple independent threads of control, but that share the same virtual memory space and file descriptor table. Use kernel threads, the design where every user-visible “thread” corresponds to a struct proc.

You’ll add three system calls:

The new sys_clone system call starts a new thread in the current process. See below for more on sys_clone.
The new sys_gettid system call returns the current thread’s thread ID. Different threads in a process have different thread IDs (and therefore get different return values from sys_gettid), but the same process ID (sys_getpid). Note that the process ID returned by sys_getpid need not equal any of the sys_gettids for that process’s threads!
The new sys_texit system call exits the current thread: the current process continues, just with one fewer thread. If all threads in a process call sys_texit, then the process as a whole exits with status 0.

You’ll also update some existing system calls:

sys_exit exits all threads in the current process.
sys_waitpid waits for processes, not threads (in other words, sys_waitpid(p) waits for all the threads of p to exit).
sys_fork clones only the calling thread.

You may assume that sys_execv is called only by single-threaded processes.

Run make run-testthread to check your work.

Process IDs and thread IDs

Kernel threading introduces a distinction between process and thread IDs: every user task has one of each, and they may not be equal. fork must allocate one of each.

We recommend storing the thread ID in proc::id_. This may feel backwards but it’s just easier to keep ptable indexed by id_. Add a new member, such as proc::pid_, to store the true process ID. You might add a table, such as proc* pidtable[NPROC], to store processes by true PID, but it’s also possible to rely entirely on ptable.

(For what it’s worth, Linux’s task_struct::pid member is a thread ID, while the true process ID is called a “thread group ID” and stored in task_struct::tgid. Its getpid system call returns tgid and its gettid returns pid.)

`sys_clone` interface

Your goal is for sys_clone(function, arg, stack_top) to set up a new thread running function(arg) on the given stack, where when that function returns, the thread exits via sys_texit. This means the initial thread stack frame should contain a return address pointing to sys_texit. Multithreading packages typically divide responsibility for this between the kernel’s system call and user-level code in the system call wrapper; most typically the stack setup is left to wrapper code.

Some examples from real operating systems:

In Linux, the clone system call takes several arguments, but none of those arguments correspond to function or arg, and the stack_top argument doesn’t actually change the active stack pointer (it is recorded in the kernel for later use). Linux clone’s return value resembles that of fork: it is 0 in the new thread and the new thread’s ID in the original thread. User-level wrapper code examines the return value and modifies the stack as appropriate. In a Linux-like Chickadee design, the sys_clone system call would take no arguments.

References: clone manual page; uClibc’s clone implementation (documented); musl libc’s clone implementation (undocumented, but more compact).
In FreeBSD, the thr_new system call does take arguments resembling function, arg, and stack_top. However, when used in a threading library, the wrapper code still adds a layer of indirection so that the sys_texit equivalent is called when the thread function returns.

In-depth suggestions

The key hurdles with implementing the sys_clone wrapper function are saving the arguments until they can be used—which is generally after the clone system call returns—and wrangling assembly.

One way forward is to write the sys_clone wrapper function entirely in assembly. Create a new u-clone.S file and add u-clone.uo to PROCESS_LIB_OBJS. You’ll want to use the C++ mangled name for sys_clone, which is likely _Z9sys_clonePFiPvES_Pc (“function named sys_clone, taking arguments with types int (*)(void*), void*, and char*”). Your assembly file might look like this:
#include "obj/u-asm.h"

.text
.globl _Z9sys_clonePFiPvES_Pc
_Z9sys_clonePFiPvES_Pc:
    # ... your code here ...
Another way is to use GCC inline assembly (reference, useful overview), the GCC extension that lets you integrate assembly with C++ code. This is how the existing wrappers are written. Some tips on wrangling inline assembly:

Use syntax like callq *%%rax to call a function whose address is stored in %rax. (In inline assembly, as in printf, the double % means a literal percent character.)

An input register constraint like "i" (SYSCALL_TEXIT) may be required to pass a numeric constant as a parameter, since the normal constraints like "r" do not always work for constants.

Either method will need to store arguments across the syscall instruction. It’s possible to store the arguments in places including the calling thread’s stack and the system’s callee-saved registers (%rbx and %r12-%r15). If you choose registers and inline assembly, you may need this style of declaration to force the compiler to use specific registers:
register uintptr_t x asm("r12") = reinterpret_cast<uintptr_t>(arg);
    // i.e., C++ variable `x` is stored in register `%r12`

Synchronization

In psets/pset5answers.md, explain your synchronization plan for the process state components that can now be shared by multiple tasks. Pay special attention to pagetable_ and the file descriptor table.

Exiting

Process exit for multithreaded processes introduces complications. The sys_exit system call should exit all of the process’s threads, not just the calling thread. But the process’s other threads might be blocked in the kernel, or running on other CPUs! This is different from prior psets, where only a sys_exit system call could cause task deletion.

Exiting a multithreaded process is complicated because:

A task cannot be deleted while its kernel task stack is in use (for reasons explained in problem set 2 and the synchronization invariants).
A task cannot be deleted while its user context is running.
A kernel task cannot be deleted while it’s blocked. For instance, you can’t delete a kernel task that is blocked waiting for a timer (in kernel.cc) or a disk event (in k-chkfs.cc). This is because a blocked kernel task might implicitly hold resources that should be released before exit. (For instance, a kernel task might block while holding a reference to a bcentry; deleting the task without unblocking it would leak the reference.)

Putting this together, the only safe time to delete a task is when the corresponding kernel task is no longer running and its kernel task stack is effectively empty. Your code in pset 2 obeyed this constraint—assuming you got it right—but that’s mostly because process exit was simple in pset 2. Now, with multithreading, exit isn’t as simple. (Other features, such as kill, complicate exiting too.)

So how can the kernel detect whether it’s safe to exit a task? One idea might be to exit a thread immediately before resuming its user context. At that time, (1) the kernel task stack is not in use, (2) the user context is not running, and (3) the kernel task is not blocked. You can decide how to implement this—while obeying the invariants, of course. But there’s already some code that validates process state before resuming user context, in two places in k-exception.S. Maybe you can add some new state, and some code in k-exception.S that checks that state! (See synchronization invariant PSI5.)

Our solutions also modify wait_queue::block_until to check for exiting and interrupted tasks.

A separate issue with exiting is that when a child process exits, it might need to wake up multiple threads in the parent process (because any of the parent process’s threads can wait for a child process). But this issue is simpler to fix, thank goodness.

The later parts of p-testthread.cc check interactions between sys_exit and sys_texit.

B. Journaling (Optional)

Add journal support to your file system. This not only shows you can manage a complex on-disk structure, it also gives you practice in orchestrating a multi-part, complex parallel system to achieve a goal.

Your VFS should ensure that groups of disk writes that should be atomic are committed with a single journal transaction. For example, a free-block-bitmap write that allocates a block should commit in the same transaction as an inode write that references the block.

You may implement any journaling mode. Writeback mode, which does not write data blocks to the journal, is a fine first choice, though it has worse recovery properties than the other modes.

You may assume that the journal is empty at boot, and journal replay support is not required. (Our tests will use obj/chickdeefsck -s IMAGE, which implements offline replay.) But you may implement journal replay if you like. We recommend using the chkfs::journalreplayer base class, which implements the (complicated!) replay analysis procedure.

It can be fun to design your own journal subsystem. The critical questions are:

Transaction allocation: How should kernel code mark the start and end of a journal transaction?
Write matching: How can the buffer cache tell whether a write is part of a journal transaction, and if so, which transaction?
Ordering constraints: How can the buffer cache ensure it delays writeback for journaled blocks until the relevant transactions commit?

Give some thought to these issues. What is the simplest solution you can think of that supports allocation, matching, and ordering? But after some thought, most people should look at the in-depth suggestions.

In-depth design suggestions

One natural implementation divides responsibility between application tasks, which add data to the journal, and a background journal task, which commits and completes outstanding transactions. These tasks synchronize using a journal state structure.

Following these instructions will produce a working journal, but one that’s relatively slow: writes to transaction \(\textit{tid}+1\) are delayed until transaction \(\textit{tid}\) completes. Optimize if you want, but not until you are confident you have a working version.

Journal state

Your journal state structure—you might call it chkjstate, or whatever else—will maintain information about currently-outstanding transactions. There is exactly one chkjstate (like chkfsstate, it’s a singleton).

The journal state will require at least:

A spinlock for synchronization.

The current journal metablock sequence number (seq). Initially arbitrary, this increases by one with every metablock written.

The current transaction ID (tid). Initially arbitrary, this increases by one with every new transaction.

A reference count for the current tid. This will equal the number of application threads that are currently adding data to the transaction. The transaction cannot be committed until its reference count reaches 0.

And you might need:

A boolean indicating whether the current tid is empty.

The commit boundary (the earliest tid that has not committed). Initially equals the current tid (i.e., no tids have committed).

The completion boundary (the earliest tid that has not completed). Initially equals the current tid (i.e., no tids have completed).

A wait queue for efficient blocking.

Transaction references

Add functions that obtain and release references to an active journal transaction, and modify your VFS to call these functions in the appropriate places.

First, modify the proc structure to include information about whether a journal transaction is active for this task, and if so, which tid is active. Initially, procs have no active transaction.

Then write the reference functions. The “get” function (perhaps chkfsstate::get_tid) obtains a transaction reference. This should increment the current tid’s reference count (which is stored in the journal state), and record the current tid as the active tid for current(), the current proc. The “put” function releases the transaction reference. This should decrement the reference count, clear the active tid, and wake up the wait queue.

Journal task

Add a journal task. This is a kernel task whose role is committing and completing journal transactions.

Eventually, the journal task will work like this:

Wait until there is a non-empty transaction that has no active references. Call this transaction the committing transaction.

Advance the current tid and mark it as empty. The committing transaction is now closed: future writes will go to different transactions.

Examine the buffer cache’s dirty list to find writes that are part of the committing transaction.

Construct a journal metablock (chkfs::jmetablock) describing those writes. Mark the metablock as committed.

Write the metablock and data blocks to the journal area.

Wait for the journal writes to complete, then update the journal state’s commit boundary.

Perform writeback for the committing transaction’s data blocks, updating the file system’s primary data area. With some care, you can use bufcache::sync for this!

Write a journal metablock to the journal area (the first journal-area block is fine) marking the transaction as having completed.

Update the journal state’s completion boundary.

Repeat.

It might be painful to get all this working at once. So start out simpler!

Wait until there is a non-empty transaction (as above).

Advance the current tid (as above).

(skip)

(skip)

(skip)

(skip)

Perform writeback for the committing transaction’s data blocks, for instance using a modified bufcache::sync.

(skip)

Update the journal state’s commit and completion boundaries.

Repeat.

This loop essentially calls bufcache::sync one transaction at a time. You can use it (and log_printfs, etc.) to debug many aspects of journal orchestration before getting to the actual journal format.

Write matching

Next, mark bufentries with their corresponding transactions as appropriate. To do this, you will add a tid to each bufentry, plus a flag indicating whether the tid is valid.

Journaled blocks must remain consistent. This requires something like the writeback logic from problem set 4. If bufcache::get_write(bufentry* e) is called and e was dirtied as part of a transaction, then get_write(e) must block until e’s transaction completes. There is an exception: if the calling process (current()) has the same tid as e, then obtaining a write reference is OK.

The writeback process (your bufcache::sync variant) should clear bufentry tids as appropriate. However, bufcache::sync must not write back bufentries that are part of uncommitted transactions.

VFS

Now add calls to your transaction-reference functions to your VFS. For example, use transactions to ensure block allocations and block-pointer updates happen atomically.

If you have done the prior stages properly, you should be able to run make cleanfs run-testwritefs2 fsck as before, even though most writeback is happening through journal orchestration.

Journal format

A lot of the hard part is over: now you just need to write the journal itself. Update your journal task to write to the journal area, as described above.

Test your work by crashing the journal task at some point after commit (for instance, by introducing an infinite loop) and running obj/chickadeefsck -V chickadeefs.img. This should print out information about what journal records were replayed, and report any errors.

We recommend using all the journal cheats.

Niceties

An industrial-strength journal subsystem would implement additional features. Yours isn’t industrial strength, but if you have time:

The handout file system has a 64-block journal area, making it impossible to commit any transaction with more than 63 data blocks. You could modify the reference functions (e.g., chkfsstate::get_tid) to automatically start a new transaction if the current transaction is getting too big for comfort.

Journaling overhead is relatively lower when many writes are in the same transaction. You could modify the journal task to hold small transactions open until some timeout period had passed (e.g., using ticks).

C. Project

For this part, you should design and implement a project of your choice which adds a new feature to your operating system!

Describe your project in a file psets/pset5project.md that you add to your repository. Your writeup should answer these questions:

What was your goal?
What’s your design?
What code did you write (what files and functions)?
What challenges did you encounter?
How can we test your work?

The writeup need not be very long; 300 words can do it if you use the words well.

Before you start working on your project in earnest, make a private Ed post which contains answers to the first and second questions above. A member of the course staff will then respond to that post and provide feedback about whether your idea is in scope and isn't too hard or too easy.

Once you finish your greenlit project, you should update psets/pset5project.md to describe what you built!

Problem set 5: Threads, journaling, and project

Turnin

A. Threads

Process IDs and thread IDs

sys_clone interface

In-depth suggestions

Synchronization

Exiting

B. Journaling (Optional)

In-depth design suggestions

Journal state

Transaction references

Journal task

Write matching

VFS

Journal format

Niceties

C. Project

`sys_clone` interface