Problem/project
The fourth problem set is open-ended: Build a distributed system—most likely using or extending your Paxos—that can survive a failure.
Your work is due at the end of reading period, May 6; but as long as you check in with me in person on or before that day and demonstrate good progress on a project we agree is worthwhile, you can have additional time (until May 13).
Requirements
-
Many of you report spending many dozens of hours on Pancydb. You learned a lot! Treat this problem set in a more relaxed way. That said, I want more work than pset 2, and of course if you get excited I’d love to see an epic project. Calibrate on 15 hours of work.
-
Testing is mandatory. Testing is an important class topic; distributed systems are notoriously full of edge cases, the Cotamer library is designed to facilitate testing, we’ve spent lots of time on failure models, and one vision of AI-assisted coding involves humans building test harnesses that specify desired software behavior. So I want to see you planning for testing and implementing automated tests of your functionality.
-
Failure handling is mandatory. Your tests must involve failure of at least one system component. Projects that are less ambitious in terms of feature development should test for more interesting kinds of failure.
-
Your turnin will comprise your code, a lab notebook describing the progress of your work, and a polished Markdown or PDF writeup analogous to a short academic paper (4–6 pages).
-
As always in this class, you may use AI assistants, but your code and writing should be edited by you. I won’t tolerate hallucinated references or prose that seems totally generic and unconnected to your work.
-
Students may work in small groups on sufficiently ambitious projects (see below), but each student will turn in their own writeup. Group projects must be approved by me in advance. Max group size is 3.
Example project directions
Some of these projects can be executed entirely in simulation, whereas others involve real-world measurement and deployment. Real-world deployment is inherently time consuming, so plan ahead for that. I will expect simulation-only projects to be tested stringently (e.g., many seeds, interesting failure models); deployed projects can be tested in simpler ways (e.g., kill a process).
Each project is tagged with an ambition score from 1 to 5. A score of 3 corresponds roughly to the 15-hour calibration above. 1 and 2 are lighter-weight; 4 and 5 are more ambitious than the baseline and are good candidates for group projects or for students who want to go deeper.
-
(Ambition 2/5) Connect your Pancydb to an RPC system and measure and optimize its real-world performance on CloudLab, the Harvard cluster, or CS 2620’s own mini-cloud (three widely-distributed computers). You can push up the ambition score by optimizing more thoroughly—leader leases, read quorum optimizations.
-
(Ambition 3/5) Connect your Pancydb to a suite of Web protocols: HTTP requests, EventSource for streaming events, WebSockets for two-way communication. Some of these may require work on the underlying Cotamer library.
-
Implement one or more Spanner features. For instance:
-
(Ambition 3/5) Client request log. Spanner client requests have unique, client-provided IDs. Spanner servers log the results of each request and check the request log to deduplicate requests, allowing Spanner to provide exactly-once execution. Implement these features for your Paxos and use them to simplify the lockseq client model.
-
(Ambition 2/5) Multi-version concurrency control, allowing your system to GET values at selected past timestamps. Develop a framework showing that reading in the past allows the system to commit more operations per second.
-
(Ambition 2/5) Multi-key operations, such as MultiOp or full transactions.
-
(Ambition 4/5) Cross-group transactions, where different partitions are handled by different Paxos “cells”, and transactions that involve data in multiple partitions are handled via two-phase commit.
Note that these features do not require TrueTime! TrueTime is an optimization strategy. You can, of course, implement multi-version CC, multi-key operations, and cross-group transactions using a TrueTime concept, but you don’t need to.
A simulation-world implementation of Spanner features would suffice for this problem set, given solid testing.
-
-
(Ambition 3/5) Implement a Spanner-like TrueTime in simulation. Add tests to show that your TrueTime correctly handles (simulated) frequency deviations.
-
(Ambition 4/5) Implement EPaxos or another Paxos variant. (EPaxos itself sits at the upper end of this range; a simpler variant such as Flexible Paxos is closer to 3/5.)
-
(Ambition 5/5) Implement a Byzantine fault-tolerant Pancydb: Extend your Paxos to handle a BFT protocol, and show that it can handle some kinds of Byzantine faults.
-
(Ambition 4/5) Implement an application using your Paxos as a back end, and show that your application can handle failures. One example application: collaborative editing à la Google Docs. This is an important problem that’s been subject to academic study, and that has libraries you can potentially use or mine for ideas. Some references:
- Collaborative Text Editing with Eg-walker: Better, Faster, Smaller
- Differential Synchronization
- An older citation: High-Latency, Low-Bandwidth Windowing in the Jupiter Collaboration System
- A Javascript library: Yjs
- Some critiques of Yjs: Lies I Was Told About Collaborative Editing, Part 1 — Part 2
-
(Ambition 3/5) We have focused on replicated key-value stores and databases, but other forms of application can also benefit from fault-tolerance and replication, such as publish-subscribe systems. Design a set of operations that support such a communication pattern, build them on top of Paxos replication, and develop a client model that shows your pub/sub system works—and that can detect problems with the underlying Paxos layer.
Infrastructure
Many of you will want to build on top of your existing code, and I’ll release updates to the Cotamer library to support your projects. Already the handout code has been extended with HTTP support, allowing you to build a Web client or server. In the next days I’ll add HTTPS support. If something else would be useful, let me know!
Intermediate deadline
Come to class Monday 4/27 having turned in a one-paragraph description of your proposed project on Canvas. We’ll discuss them.