Pancy: Paxos replication
In your third problem set, we give you a key-value database called
Pancydb (Paxos + Nancy)
to replicate. We give you a client simulator designed to
expose replication bugs. And we give you two server
simulators with different replication strategies: in
pt-single.cc, the database is not replicated, while in pt-backup.cc, there
are two servers implementing a simple primary-backup
scheme.
You will fill in the third server simulator, pt-paxos.cc, with an
implementation of Multi-Paxos replication. You’ll
define the Paxos RPCs. Your Paxos implementation should tolerate message
failure, leader failure, and server recovery, and it should be relatively
efficient.
Your work will exercise your understanding of Paxos, your RPC design skills, and your testing intuitions.
Deadline: Monday March 30. Note the turnin description.
Setting
Build and run
Build with make (which delegates to cmake). Use make SAN=1 to build with
sanitizers, and make BUILD=builddir to set the build directory.
Run the single-server driver like this:
build/pt-single
You should see output like 373 lock, 371 write, 371 clear, 371 unlock, which
indicates that the client model ran successfully and the
database passed its consistency check. As with pset 2, you can
supply -S SEED to set a seed and -R COUNT to run COUNT distinct seeds. Or
try -V to print messages sent and received, -p to print the contents of the
database at the end of the test, and -l 0.02 to apply a 2% loss rate to the
simulated channels. The pt-paxos program also understands -n NREPLICAS.
Pancydb key-value store
Many real-world Paxos implementations (Chubby, ZooKeeper, etcd)
provide state machine replication for an underlying
key-value store. Pancydb
is our key-value store for this pset; it is implemented in pset3/pancydb.{cc,hh},
with message formats in pset3/pancy_msgs.hh.
Key-value stores are databases with a limited data model and limited
transaction support. Pancydb keys and values are opaque strings, and Pancydb
supports just four operations, get, put, cas (compare-and-swap), and
remove. Each operation takes a key argument and affects only that key.
Even though your Paxos implementation will replicate Pancydb, it need not understand Pancydb messages or semantics. At least at first, it will treat Pancydb requests as opaque blobs! Paxos is an abstract replication protocol—it replicates a state machine without understanding that state machine.
Nevertheless, we still think it’s useful to understand the Pancydb context before you start working on Paxos. And later phases may blur the Paxos/Pancydb separation. Some Paxos optimizations check operation semantics; for instance, leader leases let Chubby implement a safe fast path for
gets.
Pancydb exposes value version numbers as well as values. Each value in the
database has a version number, and modifications update that version number,
allowing clients to detect concurrent changes and construct idempotent
operations. The special version
number 0, or pancy::nonexistent_version, represents absent keys; versions
associated with actual values are positive. (Value versions are never reused;
if a key is removed and then reinserted, the new value version will be greater
than any previous value version for that key.)
All responses contain value versions. Furthermore, put, cas, and remove
RPCs have a version_match parameter that supports conditional updates. A put
RPC with version_match set to -1 (pancy::any_version) will overwrite any
previous value, and will insert a new value if the key was absent. But a put
with version_match == pancy::nonexistent_version will only insert a new
value—the put will fail with error code no_match if the key existed. A put
with a positive version_match will succeed only if the version number matches
exactly.
Conditional updates are critically important for any protocol that’s meant to be
robust in the face of failures. This diagram shows how a client can ensure a
put takes effect at most once, even if it needs to retry; note how version
matching can distinguish a lost request from a lost response. (Pancydb message
printing uses _A for responses. Versions are shown as V, version match
parameters as VM.)
client server
| |
| ------ GET("key") -------------------------> | db: { "key": <"hi", V1> }
| <----------------- GET_A(✓, "hi", V1) ------ |
| |
| |
| ------ PUT("key", "ho", VM1) -> (lost) |
| |
| |
| retry timer: |
| ------ PUT("key", "ho", VM1) --------------> | version match succeeds;
| (lost) <-------- PUT_A(✓, V1→2) ------ | previous PUT must have failed.
| | db: { "key": <"ho", V2> }
| |
| |
| retry timer: |
| ------ PUT("key", "ho", VM1) --------------> | version match fails
| <---------- PUT_A(ENOMATCH, "ho", V2) ------ |
| |
Client models and test protocols
Replicated key-value stores should keep working despite a wide range of failures, and Paxos survives network failures (lost messages) as well as server failures. But how can a test distinguish a network failure—a lost request—from a replication failure? Our answer involves clients that run specific test protocols and check final database states for errors.
A test protocol is a series of orchestrated RPCs whose combined results, in the
form of a database state, can be checked for validity. You can, and should,
write your own test protocols. But we have provided one, lockseq_model, that
should catch many kinds of error given enough time.
A lockseq_model starts up 32 simulated clients. Each client repeatedly
locks a randomly-chosen group of keys, writes a value to a sequential
subset of keys, clears any leftover values from previous locks of the same
group, and then unlocks the group and tries again. For example, a client
might try:
- Lock:
CAS("g8/lock", "", "c1 5b64bab")→ version 10 (lock value is client ID + random string) - Write:
PUT("g8/v000", "c1 5b64bab") PUT("g8/v001", "c1 5b64bab")PUT("g8/v002", "c1 5b64bab")PUT("g8/v003", "c1 5b64bab")- Clear:
REMOVE("g8/v004") REMOVE("g8/v005")→ENOTFOUND(which indicates clear is done)- Unlock:
REMOVE("g8/lock", VM10)(usingversion_matchfor safe retry)
The lockseq_model will retry each RPC on timeout; the CAS, the random string
used as a lock token, and the version match ensure that even in the face of lost
messages, the lock and unlock enforce a critical section.
The lockseq_model::check function scans a pancydb database to look for
violations of its test protocol: unexpected sequence gaps, leftover values, and
stray garbage. For example, this deterministic execution of a bogus pt-backup
demonstrates a sequence gap. (Try it yourself! The -B argument means bogus.)
kohler@unshare pset3 % build/pt-backup -B -l0.01 -S11662036828332471761
117 lock, 107 write, 104 clear, 104 unlock
*** FAILURE on seed 11662036828332471761 at key g79/v020
...
g79/v014 [V1997] c14 2014a8c
g79/v015 [V2020] c14 2014a8c
g79/v016 [V2041] c14 2014a8c
g79/v017 [V2063] c14 2014a8c
g79/v018 [V2085] c14 2014a8c
* g79/v020 [V2438] c14 2014a8c
g79/v021 [V2440] c14 2014a8c
g79/v022 [V2442] c14 2014a8c
g79/v023 [V2444] c14 2014a8c
g79/v024 [V2446] c14 2014a8c
g79/v025 [V2448] c14 2014a8c
...
What’s happened here is that the primary failed after responding to client
14’s PUT("g79/v019") request, but before replicating that request to the
backup. The client eventually redirects itself to the backup and continues
with the sequence, but the unreplicated PUT is not retried, causing a gap.
Server simulators
Most of your problem set code will go in pset3/pt-paxos.cc. This is a server
simulator—code that emulates a collection of replicas and responds to client
requests. Pancy server simulators connect channels and ports to a client model
and receive and process client requests. The simplest server simulator is simple
indeed:
cot::task<> pt_single_instance::run() {
while (true) {
// Obtain request
auto req = co_await in_.receive();
// Apply request to database and respond
co_await out_.send(db_.process_req(req));
}
}
But your Paxos server simulator will be more complex, since it must handle
inter-server communication, the Paxos protocol, and failover. The architecture
looks like this. Each pt_paxos_replica has its own ports for receiving
messages (from_clients_ and from_replicas_), its own channels for sending
messages (to_clients_ for client responses and to_replica_[i] for messages
to replica i), and its own pancydb.
┌─────────────────┐
│ lockseq_model │ Client model: 32 simulated clients
└─────────────────┘
▲ ▲ ▲ request and response channels
│ │ │ (pancy::request and pancy::response)
▼ ▼ ▼
Server simulator:
pt_paxos_replica objects
┌─────┐ ┌─────┐ ┌─────┐
│ R0 │◄─►│ R1 │◄─►│ R2 │ inter-replica channels
│ │◄─►│ db │◄─►│ │ (paxos_message)
│ db │◄───────────►│ db │
└─────┘ └─────┘ └─────┘
When a client contacts a non-leader replica, that non-leader should respond with
a pancy::redirection_response containing the index of the (purported) current
leader. The client will redirect itself automatically and retry. pt-paxos.cc
already contains code for this.
Approach
Here is one reasonable plan of attack. You don’t have to follow it exactly.
Phase 1: Failure-free Multi-Paxos
In the first phase, implement inter-replica messages so that the Paxos non-leader servers keep their databases up to date. This will involve defining your own RPC protocol corresponding to Paxos, or, more specifically, the Multi-Paxos common case where there is a stable leader. Since Paxos descriptions—including ours—often focus on the uncommon case where a leader must be elected (PROBE → PREPARE → PROPOSE → ACK → DECIDE) rather than the common case of a stable leader (PROPOSE → ACK → DECIDE or EXTEND → EXTENDED → DECIDE), draw inspiration from Viewstamped Replication Revisited or Raft.
As part of this phase, you will change the paxos_message type (currently
int) to support whatever messages you define (e.g., PROPOSE, ACK, and DECIDE
messages, or messages that support a combination of these functions).
Debugging will be easier if your Paxos messages are printable. See
pancy_msgs.hhfor example formatters, or consider adding anetsim::message_traits<paxos_message>specialization that transforms your message type into a simplestd::tupleor something else that’s printable by default.
Broaden your testing by comparing working replicas’ databases with
pancydb::diff. This function compares two pancydbs for differences within a
given “version skew” (which is required because a non-leader might be
several versions behind a leader). If the databases are compatible,
pancydb::diff returns std::nullopt; if there is a problematic difference,
pancydb::diff returns the relevant key.
Your code can live entirely in pt-paxos.cc, but add whatever you’d like. Test
with 1 replica (-n 1) as a sanity check: single-replica Paxos should trivially
work.
You can add new members to the
pt_paxostypes, but don’t change the names or types of the ones that are there already. Your Paxos implementation should work for anyclient_modeland for the handoutchannel<T>andlink<T>. This is because we want to allow people to swap parts of their implementations.Read
pt-backup.cc, especially thebackup()function, for some useful Cotamer patterns, such ascot::firstfor receiving a message from any of several channels (with timeout). Refer to the Cotamer manual for full documentation.
Phase 2: Link failures
Make your Paxos work with lossy channels (e.g., -l 0.01).
The client model works fine with lossy channels—it retransmits as appropriate. But your Paxos implementation will need to retransmit inter-replica messages too, and servers will need to keep a log for resending dropped operations. You may need to change your RPCs.
However, your implementation must not grow its operation logs without bound. After all replicas have applied an operation, your implementation should eventually truncate that operation from all replica logs.
You may want to introduce varied channel loss rates. For instance, channels between consensus servers are usually more reliable than channels from clients to servers.
pset3/netsim.hhis closely related to the pset 2 version, so you may be able to port some of your changes over.
Phase 3: Leader failures
Introduce optional failure schedules that fail—and potentially recover—Paxos replicas, including Paxos leaders. Add failure detectors to non-leaders so that they claim leadership if a leader is out of commission for long enough.
A failure schedule is a coroutine that manipulates link loss rates to simulate
failures and recoveries. pt-backup.cc has a simple failure schedule; you will
need to go further. Read, and potentially change, pset3/netsim.hh and use
pset3/random_source.hh. Make sure you include failure schedules that,
explicitly or through randomness, can test multiple scenarios, including failed
leaders, temporarily failed leaders, and “split brain” phenomena (where clients
can contact all replicas, but some replicas cannot contact each other—which can
lead to multiple servers claiming leadership simultaneously). Note, though, that
failed replicas should not lose state. This means that a failed-and-recovered
replica essentially loses a series of messages and then returns. (This differs
from “Paxos Made Simple” and some other descriptions. We may add state loss and
recovery in a later pset, or you can add them as optional work.)
In this phase, test, test, test. Extensive testing should shake out bugs in your
earlier phases. Definitely call pancydb::diff to compare server states.
Your failure schedules should live in independent coroutines, and should work by modifying the channels accessible via the
pt_paxos_replicaandpt_paxos_instancetypes. This is to facilitate code exchange.
Phase 4: Optimization
In the last, open-ended phase, we want you to optimize your Multi-Paxos RPCs to improve some aspect of network performance—client latency, inter-replica bandwidth, or something else. Here are some possibilities:
-
Most likely, for every client request received by your implementation, your code goes through at least 2 message rounds (PROPOSE and ACK), and possibly more. This delay is inherent in Paxos, but there are still ways to optimize overall system performance. Our
lockseq_modelmodels 32 concurrent clients; perhaps your Paxos could batch a set of roughly-concurrent client operations into its PROPOSE, ACK, and DECIDE messages, thereby reducing bandwidth and potentially even latency. -
Consider adding Chubby-style leader leases to enable optimization of
getoperations (but if you do, you’ll need a new client model, because our client model only makes modifying operations). -
Raft and VR implement an optimization called client deduplication that can reduce consensus traffic related to retrying requests. This optimization remembers the latest request ID (the
request.serial) committed for each client, and the corresponding response. When a client retries a request, the leader can just forward the previous response without re-executing consensus for the request. This could become important if client-to-server channels have reasonably high loss rates.
Optimization requires measurement, so you’ll first have to add code to measure
the performance of your RPCs. This might involve a script that parses -V
output, but it’s probably better in the end to add performance metrics, such
as average latency between request and response, to your client model and/or
Paxos implementation. Read lockseq_model.cc to see how the client model works.
Extra phases
There’s further to go if you’d like. Consider designing and writing your own
client models; can you find one better than lockseq_model at catching
bugs? Or implement a view change protocol that supports adding and removing
servers from a running replica group. Or implement a state transfer
mechanism. This would allow a new server to copy the underlying database state
from existing servers in bulk, which is usually faster than replaying an
operation log (and has the additional advantage that it allows servers to
truncate their logs).
Hints
First read our Multi-Paxos notes.
Your first design challenge involves the concept of a slot number. Multi-Paxos agrees on successive versions of an append-only log data structure comprising a sequence of slots. Logically, each slot could be handled by a separate instance of the Paxos protocol. But slots are not Paxos rounds. When there’s a stable leader, that leader stays in a single round while committing successive operations to increasing slots. How will you represent a slot number and a round? Maybe you can combine them—e.g., put the slot number in the lower bits of a “slot + round” parameter. Or maybe it’s cleaner to keep them separate. How do Raft and VR handle this?
As in pset 2, the simulation is fully deterministic for a given seed. If you
find a failing seed, you can reproduce and debug it exactly. Tag your code with
git tag when you find interesting failures so you can come back to them.
The handout code is configured with loss rate 0, but after phase 1, it’s probably better to change the default loss rate to something like 0.01. With no loss, even a broken protocol may appear to work.
Turnin
You will commit your code and a writeup in pset3/TURNIN.md that describes your
work.
You are amazing, but at some point during your work, you will have a bug in your
implementation. Your TURNIN.md must describe this bug, give a commit hash
that demonstrates the bug, and describe how you found it and how you fixed it.
Give specific pt-paxos arguments that demonstrate the error.