Pancy: Paxos replication

In your third problem set, we give you a key-value database called Pancydb (Paxos + Nancy) to replicate. We give you a client simulator designed to expose replication bugs. And we give you two server simulators with different replication strategies: in pt-single.cc, the database is not replicated, while in pt-backup.cc, there are two servers implementing a simple primary-backup scheme.

You will fill in the third server simulator, pt-paxos.cc, with an implementation of Multi-Paxos replication. You’ll define the Paxos RPCs. Your Paxos implementation should tolerate message failure, leader failure, and server recovery, and it should be relatively efficient.

Your work will exercise your understanding of Paxos, your RPC design skills, and your testing intuitions.

Deadline: Monday March 30. Note the turnin description.

Setting

Build and run

Build with make (which delegates to cmake). Use make SAN=1 to build with sanitizers, and make BUILD=builddir to set the build directory.

Run the single-server driver like this:

build/pt-single

You should see output like 373 lock, 371 write, 371 clear, 371 unlock, which indicates that the client model ran successfully and the database passed its consistency check. As with pset 2, you can supply -S SEED to set a seed and -R COUNT to run COUNT distinct seeds. Or try -V to print messages sent and received, -p to print the contents of the database at the end of the test, and -l 0.02 to apply a 2% loss rate to the simulated channels. The pt-paxos program also understands -n NREPLICAS.

Pancydb key-value store

Many real-world Paxos implementations (Chubby, ZooKeeper, etcd) provide state machine replication for an underlying key-value store. Pancydb is our key-value store for this pset; it is implemented in pset3/pancydb.{cc,hh}, with message formats in pset3/pancy_msgs.hh.

Key-value stores are databases with a limited data model and limited transaction support. Pancydb keys and values are opaque strings, and Pancydb supports just four operations, get, put, cas (compare-and-swap), and remove. Each operation takes a key argument and affects only that key.

Even though your Paxos implementation will replicate Pancydb, it need not understand Pancydb messages or semantics. At least at first, it will treat Pancydb requests as opaque blobs! Paxos is an abstract replication protocol—it replicates a state machine without understanding that state machine.

Nevertheless, we still think it’s useful to understand the Pancydb context before you start working on Paxos. And later phases may blur the Paxos/Pancydb separation. Some Paxos optimizations check operation semantics; for instance, leader leases let Chubby implement a safe fast path for gets.

Pancydb exposes value version numbers as well as values. Each value in the database has a version number, and modifications update that version number, allowing clients to detect concurrent changes and construct idempotent operations. The special version number 0, or pancy::nonexistent_version, represents absent keys; versions associated with actual values are positive. (Value versions are never reused; if a key is removed and then reinserted, the new value version will be greater than any previous value version for that key.)

All responses contain value versions. Furthermore, put, cas, and remove RPCs have a version_match parameter that supports conditional updates. A put RPC with version_match set to -1 (pancy::any_version) will overwrite any previous value, and will insert a new value if the key was absent. But a put with version_match == pancy::nonexistent_version will only insert a new value—the put will fail with error code no_match if the key existed. A put with a positive version_match will succeed only if the version number matches exactly.

Conditional updates are critically important for any protocol that’s meant to be robust in the face of failures. This diagram shows how a client can ensure a put takes effect at most once, even if it needs to retry; note how version matching can distinguish a lost request from a lost response. (Pancydb message printing uses _A for responses. Versions are shown as V, version match parameters as VM.)

client                                          server
  |                                               |
  |  ------ GET("key") -------------------------> |  db: { "key": <"hi", V1> }
  |  <----------------- GET_A(✓, "hi", V1) ------ |
  |                                               |
  |                                               |
  |  ------ PUT("key", "ho", VM1) -> (lost)       |
  |                                               |
  |                                               |
  |  retry timer:                                 |
  |  ------ PUT("key", "ho", VM1) --------------> |  version match succeeds;
  |        (lost) <-------- PUT_A(✓, V1→2) ------ |  previous PUT must have failed.
  |                                               |  db: { "key": <"ho", V2> }
  |                                               |
  |                                               |
  |  retry timer:                                 |
  |  ------ PUT("key", "ho", VM1) --------------> |  version match fails
  |  <---------- PUT_A(ENOMATCH, "ho", V2) ------ |
  |                                               |

Client models and test protocols

Replicated key-value stores should keep working despite a wide range of failures, and Paxos survives network failures (lost messages) as well as server failures. But how can a test distinguish a network failure—a lost request—from a replication failure? Our answer involves clients that run specific test protocols and check final database states for errors.

A test protocol is a series of orchestrated RPCs whose combined results, in the form of a database state, can be checked for validity. You can, and should, write your own test protocols. But we have provided one, lockseq_model, that should catch many kinds of error given enough time.

A lockseq_model starts up 32 simulated clients. Each client repeatedly locks a randomly-chosen group of keys, writes a value to a sequential subset of keys, clears any leftover values from previous locks of the same group, and then unlocks the group and tries again. For example, a client might try:

Lock: CAS("g8/lock", "", "c1 5b64bab") → version 10 (lock value is client ID + random string)
Write: PUT("g8/v000", "c1 5b64bab")
PUT("g8/v001", "c1 5b64bab")
PUT("g8/v002", "c1 5b64bab")
PUT("g8/v003", "c1 5b64bab")
Clear: REMOVE("g8/v004")
REMOVE("g8/v005") → ENOTFOUND (which indicates clear is done)
Unlock: REMOVE("g8/lock", VM10) (using version_match for safe retry)

The lockseq_model will retry each RPC on timeout; the CAS, the random string used as a lock token, and the version match ensure that even in the face of lost messages, the lock and unlock enforce a critical section.

The lockseq_model::check function scans a pancydb database to look for violations of its test protocol: unexpected sequence gaps, leftover values, and stray garbage. For example, this deterministic execution of a bogus pt-backup demonstrates a sequence gap. (Try it yourself! The -B argument means bogus.)

kohler@unshare pset3 % build/pt-backup -B -l0.01 -S11662036828332471761
117 lock, 107 write, 104 clear, 104 unlock
*** FAILURE on seed 11662036828332471761 at key g79/v020
   ...
   g79/v014 [V1997] c14 2014a8c
   g79/v015 [V2020] c14 2014a8c
   g79/v016 [V2041] c14 2014a8c
   g79/v017 [V2063] c14 2014a8c
   g79/v018 [V2085] c14 2014a8c
 * g79/v020 [V2438] c14 2014a8c
   g79/v021 [V2440] c14 2014a8c
   g79/v022 [V2442] c14 2014a8c
   g79/v023 [V2444] c14 2014a8c
   g79/v024 [V2446] c14 2014a8c
   g79/v025 [V2448] c14 2014a8c
   ...

What’s happened here is that the primary failed after responding to client 14’s PUT("g79/v019") request, but before replicating that request to the backup. The client eventually redirects itself to the backup and continues with the sequence, but the unreplicated PUT is not retried, causing a gap.

Server simulators

Most of your problem set code will go in pset3/pt-paxos.cc. This is a server simulator—code that emulates a collection of replicas and responds to client requests. Pancy server simulators connect channels and ports to a client model and receive and process client requests. The simplest server simulator is simple indeed:

cot::task<> pt_single_instance::run() {
    while (true) {
        // Obtain request
        auto req = co_await in_.receive();

        // Apply request to database and respond
        co_await out_.send(db_.process_req(req));
    }
}

But your Paxos server simulator will be more complex, since it must handle inter-server communication, the Paxos protocol, and failover. The architecture looks like this. Each pt_paxos_replica has its own ports for receiving messages (from_clients_ and from_replicas_), its own channels for sending messages (to_clients_ for client responses and to_replica_[i] for messages to replica i), and its own pancydb.

              ┌─────────────────┐
              │  lockseq_model  │   Client model: 32 simulated clients
              └─────────────────┘
                  ▲    ▲    ▲       request and response channels
                  │    │    │       (pancy::request and pancy::response)
                  ▼    ▼    ▼
                Server simulator:
            pt_paxos_replica objects
          ┌─────┐   ┌─────┐   ┌─────┐
          │ R0  │◄─►│ R1  │◄─►│ R2  │  inter-replica channels
          │     │◄─►│ db  │◄─►│     │  (paxos_message)
          │ db  │◄───────────►│ db  │
          └─────┘   └─────┘   └─────┘

When a client contacts a non-leader replica, that non-leader should respond with a pancy::redirection_response containing the index of the (purported) current leader. The client will redirect itself automatically and retry. pt-paxos.cc already contains code for this.

Approach

Here is one reasonable plan of attack. You don’t have to follow it exactly.

Phase 1: Failure-free Multi-Paxos

In the first phase, implement inter-replica messages so that the Paxos non-leader servers keep their databases up to date. This will involve defining your own RPC protocol corresponding to Paxos, or, more specifically, the Multi-Paxos common case where there is a stable leader. Since Paxos descriptions—including ours—often focus on the uncommon case where a leader must be elected (PROBE → PREPARE → PROPOSE → ACK → DECIDE) rather than the common case of a stable leader (PROPOSE → ACK → DECIDE or EXTEND → EXTENDED → DECIDE), draw inspiration from Viewstamped Replication Revisited or Raft.

As part of this phase, you will change the paxos_message type (currently int) to support whatever messages you define (e.g., PROPOSE, ACK, and DECIDE messages, or messages that support a combination of these functions).

Debugging will be easier if your Paxos messages are printable. See pancy_msgs.hh for example formatters, or consider adding a netsim::message_traits<paxos_message> specialization that transforms your message type into a simple std::tuple or something else that’s printable by default.

Broaden your testing by comparing working replicas’ databases with pancydb::diff. This function compares two pancydbs for differences within a given “version skew” (which is required because a non-leader might be several versions behind a leader). If the databases are compatible, pancydb::diff returns std::nullopt; if there is a problematic difference, pancydb::diff returns the relevant key.

Your code can live entirely in pt-paxos.cc, but add whatever you’d like. Test with 1 replica (-n 1) as a sanity check: single-replica Paxos should trivially work.

You can add new members to the pt_paxos types, but don’t change the names or types of the ones that are there already. Your Paxos implementation should work for any client_model and for the handout channel<T> and link<T>. This is because we want to allow people to swap parts of their implementations.

Read pt-backup.cc, especially the backup() function, for some useful Cotamer patterns, such as cot::first for receiving a message from any of several channels (with timeout). Refer to the Cotamer manual for full documentation.

Phase 2: Link failures

Make your Paxos work with lossy channels (e.g., -l 0.01).

The client model works fine with lossy channels—it retransmits as appropriate. But your Paxos implementation will need to retransmit inter-replica messages too, and servers will need to keep a log for resending dropped operations. You may need to change your RPCs.

However, your implementation must not grow its operation logs without bound. After all replicas have applied an operation, your implementation should eventually truncate that operation from all replica logs.

You may want to introduce varied channel loss rates. For instance, channels between consensus servers are usually more reliable than channels from clients to servers.

pset3/netsim.hh is closely related to the pset 2 version, so you may be able to port some of your changes over.

Phase 3: Leader failures

Introduce optional failure schedules that fail—and potentially recover—Paxos replicas, including Paxos leaders. Add failure detectors to non-leaders so that they claim leadership if a leader is out of commission for long enough.

A failure schedule is a coroutine that manipulates link loss rates to simulate failures and recoveries. pt-backup.cc has a simple failure schedule; you will need to go further. Read, and potentially change, pset3/netsim.hh and use pset3/random_source.hh. Make sure you include failure schedules that, explicitly or through randomness, can test multiple scenarios, including failed leaders, temporarily failed leaders, and “split brain” phenomena (where clients can contact all replicas, but some replicas cannot contact each other—which can lead to multiple servers claiming leadership simultaneously). Note, though, that failed replicas should not lose state. This means that a failed-and-recovered replica essentially loses a series of messages and then returns. (This differs from “Paxos Made Simple” and some other descriptions. We may add state loss and recovery in a later pset, or you can add them as optional work.)

In this phase, test, test, test. Extensive testing should shake out bugs in your earlier phases. Definitely call pancydb::diff to compare server states.

Your failure schedules should live in independent coroutines, and should work by modifying the channels accessible via the pt_paxos_replica and pt_paxos_instance types. This is to facilitate code exchange.

Phase 4: Optimization

In the last, open-ended phase, we want you to optimize your Multi-Paxos RPCs to improve some aspect of network performance—client latency, inter-replica bandwidth, or something else. Here are some possibilities:

Most likely, for every client request received by your implementation, your code goes through at least 2 message rounds (PROPOSE and ACK), and possibly more. This delay is inherent in Paxos, but there are still ways to optimize overall system performance. Our lockseq_model models 32 concurrent clients; perhaps your Paxos could batch a set of roughly-concurrent client operations into its PROPOSE, ACK, and DECIDE messages, thereby reducing bandwidth and potentially even latency.
Consider adding Chubby-style leader leases to enable optimization of get operations (but if you do, you’ll need a new client model, because our client model only makes modifying operations).
Raft and VR implement an optimization called client deduplication that can reduce consensus traffic related to retrying requests. This optimization remembers the latest request ID (the request.serial) committed for each client, and the corresponding response. When a client retries a request, the leader can just forward the previous response without re-executing consensus for the request. This could become important if client-to-server channels have reasonably high loss rates.

Optimization requires measurement, so you’ll first have to add code to measure the performance of your RPCs. This might involve a script that parses -V output, but it’s probably better in the end to add performance metrics, such as average latency between request and response, to your client model and/or Paxos implementation. Read lockseq_model.cc to see how the client model works.

Extra phases

There’s further to go if you’d like. Consider designing and writing your own client models; can you find one better than lockseq_model at catching bugs? Or implement a view change protocol that supports adding and removing servers from a running replica group. Or implement a state transfer mechanism. This would allow a new server to copy the underlying database state from existing servers in bulk, which is usually faster than replaying an operation log (and has the additional advantage that it allows servers to truncate their logs).

Hints

First read our Multi-Paxos notes.

Your first design challenge involves the concept of a slot number. Multi-Paxos agrees on successive versions of an append-only log data structure comprising a sequence of slots. Logically, each slot could be handled by a separate instance of the Paxos protocol. But slots are not Paxos rounds. When there’s a stable leader, that leader stays in a single round while committing successive operations to increasing slots. How will you represent a slot number and a round? Maybe you can combine them—e.g., put the slot number in the lower bits of a “slot + round” parameter. Or maybe it’s cleaner to keep them separate. How do Raft and VR handle this?

As in pset 2, the simulation is fully deterministic for a given seed. If you find a failing seed, you can reproduce and debug it exactly. Tag your code with git tag when you find interesting failures so you can come back to them.

The handout code is configured with loss rate 0, but after phase 1, it’s probably better to change the default loss rate to something like 0.01. With no loss, even a broken protocol may appear to work.

Turnin

You will commit your code and a writeup in pset3/TURNIN.md that describes your work.

You are amazing, but at some point during your work, you will have a bug in your implementation. Your TURNIN.md must describe this bug, give a commit hash that demonstrates the bug, and describe how you found it and how you fixed it. Give specific pt-paxos arguments that demonstrate the error.