Show all solutions
Rules
The exam is open book, open note, open computer. You may search the Internet and class material arbitrarily, with the following exception: You absolutely may not contact people other than course staff (e.g., to ask questions on Stack Overflow, or an exam site, or your classmates). Use private Piazza questions to contact course staff if necessary.
Any violations of the spirit of this policy are breaches of academic honesty and will be treated accordingly.
You have five hours to complete this exam, starting from the time you access this page. The questions are not sorted by difficulty.
Completing your exam
Obtain an exam repository via this URL: https://classroom.github.com/a/z58luZrB
Write your answers in final.md
, sign your name in finalpolicy.md
, then
commit and push. Configure the grading server with your repository.
1. Resumption state (10 points)
A. At the beginning of proc::exception(regstate* regs)
, does
this->regs_ == regs
? Explain briefly.
No.
this->regs_
is resumption state, set only for processes that are about to sleep, or for processes that are in the process of being set up.
B. Which registers are always preserved across a call to proc::yield
?
All callee-saved registers, including special-purpose registers like
%cr3
, and%rbx
,%r12-%r15
,%rsp
,%rbp
.
C. Of those registers, which are preserved by the C++ compiler’s
implementation of the calling convention and which are saved by Chickadee’s
implementation of proc::yield
?
All of them are preserved by
proc::yield
/proc::resume
.
In the following questions, we want you to explain how to access the value for
some logical registers. Possible answers might include “in CPU 0’s active
register set”; “in p->regs_->reg_rcx
”; “in regs->reg_rdx
, where regs
is
the argument to proc::exception
”; “the compiler will have saved this value
in some stack frame, if needed”; and “nowhere, that register is meaningless in
that context.” Explain briefly if unsure.
D. The %rsp
for a kernel task that is currently blocked within a call to
proc::yield
.
The task’s
yields_
.
E. The %rax
for a user thread that is currently running on CPU 1.
In CPU 1’s active register set.
F. The %rax
for a kernel task that is currently running on CPU 1.
In CPU 1’s active register set.
G. The %rdi
for a user thread that is currently blocked due to a timer
interrupt.
In the task’s
regs_->reg_rdi
.
H. The %rdi
for a kernel task that is currently blocked due to a timer
interrupt.
In the task’s
regs_->reg_rdi
.
I. The %rdi
for a kernel task that is currently blocked within a call to
proc::yield
.
Not saved.
Also OK: Saved on the stack, before the
proc::yield
stack frame, by the compiler.
J. The %cr3
for a user thread that is blocked.
The associated task’s
pagetable_
. (The actual value in%cr3
will beka2pa(p->pagetable_)
.)
2. Interrupt handling (10 points)
A commodity OS interrupt handler typically behaves this way:
- The interrupt handler begins execution on a CPU stack when the CPU delivers an hardware interrupt.
- The handler communicates with hardware.
- The handler acknowledges the interrupt and the OS returns to normal processing on some kernel task stack.
Typical operating systems ensure Step 2 takes as little time as possible. This is because step 2 takes priority over all other processing, including new interrupts. But this can make interrupt handlers hard to reason about; in particular, interrupt handlers cannot block.
Describe how you would modify an OS to allow interrupt handlers to block. Be explicit about the new types of state that you would introduce to support sleeping interrupt handlers; describe when the state is created, how it is initialized, and when it is destroyed. To make your explanation concrete, explain how your design would support an interrupt handler which is triggered when a network packet arrives; the handler should write the packet data to disk using a blocking IO, and wait for the disk write to complete before terminating. Your design should still acknowledge interrupts as soon as possible.
The simplest approach is for the OS to create a new kernel thread for every interrupt. The kernel thread would pull information from the CPU stack to determine which interrupt had arrived; the thread would then do whatever is necessary to service the interrupt (e.g., issuing a disk write), sleeping as necessary. Once all of the work is done, the thread would exit like any other thread, and have its resources deallocated by the kernel.
3. System calls (10 points)
The return path from the syscall
instruction is currently very different
from the return path from an interrupt:
## the exception_entry return path:
restore_and_iret:
// restore `regstate` registers
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %rbp
popq %rsi
popq %rdi
popq %r8
popq %r9
popq %r10
popq %r11
popq %r12
popq %r13
popq %r14
popq %r15
popq %fs
// must `swapgs` if required
testb $1, 12(%rsp) // `reg_swapgs & 1`
jnz 1f
// returning to kernel mode
addq $24, %rsp
iretq
1: // returning to user mode
swapgs
popq %gs
addq $16, %rsp
iretq
## the syscall_entry return path:
addq $(8 * 19), %rsp
swapgs
iretq
A. Why does the exception_entry
return path execute the swapgs
instruction only sometimes? Explain briefly.
Because exceptions and interrupts can be received in the kernel, and
swapgs
should be executed only when crossing the user/kernel boundary.
B. What would go wrong, if anything, if the syscall_entry
return path
were also used for exception_entry
? Explain briefly.
(1) Returning from an interrupt of any kind will not restore registers properly. (2)
swapgs
may be called the wrong number of times, causing the whole OS to fail (the current process won’t be locatable).One good answer gets full credit.
C. What would go wrong, if anything, if the exception_entry
return path
were also used for syscall_entry
? Explain briefly.
This will work fine—except that the system call return value will be replaced by the system-call-time
%rax
, so that system call return values will all be lost!
D. The current design has a serious flaw: it can lead to sensitive kernel information being exposed to a user process. Explain how this could occur; be specific.
The current design does not restore or clear callee-saved registers to safe values. This may expose kernel-sensitive values that happen to be located in callee-saved registers at the time of system call return.
4. Futexes (15 points)
Recall that futex
is a system call intended to be used by user-level locking
libraries.
A. Suppose that a user-level threading library wants to support a lock abstraction. Why is it a bad idea to implement the lock abstraction using spin waiting?
This has poor utilization, because the spin waiting uses CPU time uselessly.
B. Is the primary purpose of futex
to improve correctness (e.g.,
preventing errors and race conditions, ensuring mutual exclusion) or
efficiency (e.g., raising utilization)? Explain your answer.
Efficiency. Blocking improves utilization by not wasting CPU; and the futex design also improves efficiency in uncontended operation, by avoiding expensive system calls in that case.
As Ulrich Drepper says, “futexes are tricky.” Here is a (we believe) correct futex-based implementation of a mutual-exclusion lock. In the next several questions, you will consider several modifications of this code.
struct futex_lock {
std::atomic<int> val_;
// 0 = unlocked; 1 = locked, no futex waiters;
// 2 = locked, maybe futex waiters
void lock() {
// Phase 1
for (unsigned i = 0; i < 40; i++) {
int expected = 0;
if (val_.compare_exchange_weak(expected, 1)) {
return;
}
sched_yield();
}
// Phase 2
int previous = val_.load(std::memory_order_relaxed);
if (previous != 2) {
previous = val_.exchange(2);
}
while (previous != 0) {
futex(&val_, FUTEX_WAIT, 2);
previous = val_.exchange(2);
}
}
void unlock() {
if (--val_ != 0) { // atomic decrement
val_ = 0;
futex(&val_, FUTEX_WAKE, 1);
}
}
};
Consider replacing the Phase 1 part of lock
with the following code (leaving
Phase 2 as is).
int expected = 0;
if (val_.compare_exchange_weak(expected, 1)) {
return;
}
C. Is this modified lock correct (does it provide mutual exclusion without starvation)? If not, explain why.
Yes, it is correct.
D. Describe a workload where this lock will likely perform essentially the same as the original. (Describe workloads in terms of their parameters, such as contention level.) Explain briefly.
In an uncontended, or very-low-contention, case (
val_ == 0
whenlock
is called), this will execute basically the same instructions as before: a single compare-and-swap.
E. Describe a workload where this lock will likely perform better than the original. Explain briefly.
In a high-contention case (
val_ > 0
most timeslock
is called, and the lock is not unlocked quickly), spinning wastes CPU; it is more efficient to block immediately.
F. Describe a workload where this lock will likely perform worse than the original. Explain briefly.
In a medium-contention case (the lock is sometimes locked when
lock
is called, but it is usually unlocked quickly), this will block (using a system call) rather than briefly spin (which is cheaper).
Now consider deleting lock
’s Phase 1 (going straight to Phase 2).
G. Is this modified lock correct? If not, explain why.
Yes, it is correct.
H. Explain how this lock’s performance will differ from the original (in cases when it works correctly).
This lock always blocks (unless the lock is unlocked), and the
unlock
function always makes a system call to wake the futex (even in a low-contention scenario). It will be less efficient.
Now consider replacing lock
’s Phase 1 with the following code.
for (unsigned i = 0; i < 40; i++) {
if (val_.exchange(1) == 0) {
return;
}
sched_yield();
}
I. Is this modified lock correct? If not, explain why.
No, it is not correct. This version can lose a wakeup, leaving a thread blocked forever.
val_ = 0 THREAD 1: successfully acquires the lock val_ = 1 THREAD 2: tries to acquire the lock, eventually blocks val_ = 2 THREAD 3: tries to acquire the lock once, calls sched_yield() val_ = 1 (was exchanged for 2) THREAD 1: releases the lock; thread 1 thinks there are no waiters! val_ = 0 THREAD 3: successfully acquires the lock val_ = 1 THREAD 3: releases the lock val_ = 0 THREAD 2 blocks forever!
J. Explain how this lock’s performance will differ from the original (in cases when it works correctly).
In the cases that this lock works correctly, it’ll perform about the same as the original.
ALSO ACCEPTABLE: This lock is incorrect.
5. Barrelfish (12 points)
Section 4.6 of the Barrelfish paper describes the implementation of a lightweight URPC (user-level remote procedure call) system for inter-core communication among user-level processes. Describe how this would be implemented in Chickadee, following the Barrelfish design. Include at minimum:
A. The layout of the struct urpcregion
structure, which defines the layout
of a URPC shared-memory region.
struct alignas(64) urpcmessage { char data_[60]; int signal_; // 0 means empty, 1 means present }; // size 1 cache line struct urpcregion { constexpr size_t nmsg = 4096 / sizeof(urpcmessage); // == 64 urpcmessage m_[nmsg]; };
More advanced design:
struct alignas(64) urpcmessage { char data_[60]; std::atomic<int> signal_; // 0 means empty, 1 means present, -1 means waiting }; // size 1 cache line struct urpcbuffer { constexpr size_t nmsg = 4096 / sizeof(urpcmessage); // == 64 urpcmessage m_[nmsg]; }; struct urpcregion { urpcbuffer* buf_; size_t wpos_; size_t rpos_; };
B. Pseudocode for the char* receive(urpcregion* r)
function, which waits
until a new message is sent, then returns a pointer to the location of that
message’s data in the shared-memory region. (You’ll obviously also need a
design for send
, but you don’t need to write it if you don’t want to.)
This code waits for a message to be received in slot
rpos
. We assumerpos
is a local variable, not shared among processes. The more advanced design above storesrpos
in non-shared memory.Polling-only implementation:
char* receive(urpcregion* r) { int mypos = rpos; while (r->m_[mypos].signal_ == 0) { pause(); } rpos = (rpos + 1) % r->nmsg; return r->m_[mypos].data_; }
A full-credit implementation would block after a while.
char* receive(urpcregion* r) { int mypos = rpos; int tries = 0; for (int tries = 0; tries != 40; ++tries) { if (r->m_[mypos].signal_ == 1) { goto found; } sched_yield(); } if (r->m_[mypos].signal_.exchange(-1) == 1) { goto found; } futex(&r->m_[mypos].signal_, FUTEX_WAIT, -1); found: rpos = (rpos + 1) % r->nmsg; return r->m_[mypos].data_; }
This implementation would require some atomics on the send side too.
C. New system calls required (you will need at least one!).
We need a system call to map the shared memory region in two (or more) processes! We also need a system call for futex, or more generally for blocking (“blocking and sending a request to its local monitor to be notified when messages arrive”).
D. Anything else you think is relevant.
N/A
Make sure you consider the initialization process as well as normal operation.
6. NFS (13 points)
Suppose that you want to layer a user-mode journaling file system atop the
NFS file /nfs/diskfile
. The user-mode journaling file system, called ujfs,
will preallocate 10 GB of space for /nfs/diskfile
, and then treat
/nfs/diskfile
as a 10 GB disk; for example, the first part of /nfs/diskfile
will contain a superblock, and the latter part of /nfs/diskfile
will contain
data blocks and a journal. When ujfs wants to update a ujfs-level file, ujfs
will first write to the journal portion of /nfs/diskfile
, and then write to
the data block portion of /nfs/diskfile
.
Assume that (1) the NFS server and the NFS client may crash, but (2) the network connection between the NFS client and the NFS server is always available and has good performance. Describe any additional changes that would be required—to the network, to NFS, and/or to the ujfs code—for ujfs to provide equivalent consistency guarantees as a traditional journaling file system like ext4.
The basic problem is that NFS does not specify a write-back order for dirty blocks. So, assume that the ujfs client opens
/nfs/diskfile
once and does not close it until the client powers off. The ujfs client will carefully order writes to the journal part of/nfs/diskfile
and the data part of/nfs/diskfile
; the careful ordering is necessary to provide crash consistency. However, since NFS does not specify a write-back order for dirty blocks, the NFS server may receive writes in a different order than the ujfs client generated them!The simplest fix is for ujfs to explicitly invoke
fysnc(int fd)
to force the NFS client to flush writes to the NFS server in the order needed to ensure post-crash consistency.
7. Virtualization (10 points)
Suppose that a hypervisor uses hosted interpretation to execute a guest application. Further suppose that the guest application contains a JIT compiler, such that the JIT compiler (1) allocates a new memory page, (2) marks that page as writable and executable, (3) writes binary code to the page, and then (4) jumps to the first instruction in the page. Can the hypervisor detect the jump and interpose on the dynamically-generated instructions? Why or why not?
Yes, the hypervisor can detect the jump and interpose on the dynamically-generated instructions. Activities (1) through (4) all involve emulated hardware state that is completely controlled by the hypervisor. For example, step (2) is performed by a VM-level invocation of a system call like mprotect(); the hypervisor sees all VM attempts to invoke the syscall instruction, and can emulate the requested changes to page protections. When (4) occurs, the jump instruction itself will be interpreted, and will result in the hypervisor setting the emulated PC to the PC of the jump target. The hypervisor will then start decoding instructions and emulating them one by one.
8. Affinity (15 points)
Multiprocessor operating systems support notions of processor affinity and
CPU migration, in which tasks (including process threads and kernel tasks)
switch from processor to processor as they run. This can be important for load
balancing—maybe all the threads on processor 0 happen to exit at about the
same time, leaving processor 0 idle and the other processors oversubscribed—so
a good kernel scheduler will proactively migrate tasks to mitigate this
imbalance. There are also system calls that manage CPU placement directly,
such as sched_setaffinity
.
A. Design a system-call-initiated CPU migration for Chickadee.
Specifically, describe how you might implement a sys_sched_setcpu(pid_t p,
int cpu)
system call, which should work as follows:
If
cpu < 0 || cpu >= ncpu
, then return an error.Otherwise, if
p != 0 && p != current()->id_
, then return an error (you’ll fix this in the next exercise).Otherwise, the system call returns 0 and the calling thread (process) next executes on CPU
cpu
. That is, when the system call returns 0, the unprivileged process code is executing on CPUcpu
.
If you need one or more struct proc
members, describe
them. Think about Chickadee synchronization invariants—your design should obey
them!—and describe any new invariants you might require.
B. Extend your sys_sched_setcpu
design so processes can change other
processes’ CPU placements (that is, support p != 0 && p != current()->id_
).
This is a tough one. Full credit for a solution that works for both A and B.
One good solution involves a
proc::preferred_cpu_
member which the system call sets, and whichcpustate::schedule()
checks. Ifpreferred_cpu_
differs from the current CPU’sindex_
, thencpustate::schedule()
will move the process to its desired home by locking both run queues, avoiding deadlock, and callingcpustate::enqueue()
on the new CPU.Many students may already have a
cpu_
member, but it’s risky to just modify that member without involvingcpustate::schedule()
. This is because several functions (such asproc::wake()
) may schedule theproc
using code like this:void proc::wake() { spinlock_guard guard(cpus[cpu_].runq_lock_); if (state_ == blocked) { state_ = runnable; cpus[cpu_].enqueue(this); } }
A Chickadee invariant says that a process is scheduled or running on at most one CPU runqueue at a time. What would happen if, just as a process’s home CPU was changed, another process woke it up and scheduled it on its old CPU’s runqueue? Worse, the
proc::cpu_
member can change now. What protects it? Do we need another layer of locking?Consider an OS with these features:
- Each
proc
has aproc::cpu_
member indicating its current CPU. If it is running (in user or kernel mode), it is running on that CPU.proc::state_
may be modified only by the corresponding kernel task, except thatblocked
processes may be set torunnable
(and scheduled) byproc::wake()
.proc::wake()
synchronizes usingcpus[cpu_].runq_lock_
.- The only function used to schedule an already-existing process is
proc::wake()
. (No other function callscpustate::schedule()
, except for functions likefork
that create new processes.)This OS could use the following
sys_sched_setcpu
implementation:case SYSCALL_SCHED_SETCPU: { uintptr_t pid = regs->reg_rdi; uintptr_t cpu = regs->reg_rsi; if (cpu >= uintptr_t(ncpu) || pid >= uintptr_t(NPROC)) { return E_INVAL; } { spinlock_guard guard1(ptable_lock); proc* p = pid ? ptable[pid] : this; if (!p || (p->state_ != blocked && p->state_ != runnable)) { return E_SRCH; } p->preferred_cpu_ = cpu; } // ensure we return on new CPU (only necessary if `p == this`) regs_ = regs; regs_->reg_rax = 0; // return value yield_noreturn(); }
For tasks changing their own CPU, it’s important not to call
return 0;
directly, because that would inappropriately execute user code on the old CPU.This would be coupled with code like this in
cpustate::schedule()
:// try to run `current` if (current_ && current_->state_ == proc::runnable && current_ != yielding_from) { // NEW: place on correct CPU int prefcpu = current_->preferred_cpu_; if (prefcpu != index_) { spinlock_guard guard1(cpus[min(prefcpu, index_)].runq_lock_); spinlock_guard guard2(cpus[max(prefcpu, index_)].runq_lock_); current_->cpu_ = prefcpu; cpus[prefcpu].enqueue(current_); // switch to a safe page table current_ = nullptr; wrcr3(ktext2pa(early_pagetable)); } else { set_pagetable(current_->pagetable_); current_->resume(); } }
And to avoid undefined behavior,
cpu_
andpreferred_cpu_
would be atomics (std::atomic<int>
). A new invariant would be introduced saying thatproc::cpu_
may only be modified by thecpustate::schedule()
on theproc
’s oldcpu_
.Part B requires that the transfer process involve both run queue locks, rather than just the new CPU’s run queue lock. This is to avoid
proc::wake()
and thecpustate::schedule()
migration code enqueuing the task on different CPUs simultaneously.Other designs are possible; for instance, a new
proc::state_
value could indicate migrating tasks.It is tempting to simply schedule the
proc
on the relevant run queue withinproc::syscall()
, without delegating any work tocpustate::schedule
. For example,sys_sched_setcpu
might (1) acquire both CPUs’ run queue locks, (2) change thecpu_
member, (3) schedule theproc
on the new CPU’s run queue, and (4) yield. But this doesn’t work for part A, becausecpustate::schedule()
will try to re-enqueue theproc
on its own run queue (after all, theproc
is still runnable); and it doesn’t work for part B, because the relevant task might already be running on the wrong CPU!
9. Hack (5 points)
A. Describe one feature of the x86-64 architecture you’ve encountered in Chickadee that you think is a hack, and explain why.
B. Describe one feature of your Chickadee that you think is a hack, and explain why.
Any answer but no answer should get full credit.
Finally
Did you remember to sign your name in finalpolicy.md
?
Thank you for taking the course!