Paxos

Paxos is arguably the most important distributed systems algorithm, central to many of our most critical distributed systems, and yet the problem it solves is just consensus. Sure, that’s the core distributed systems problem, but that problem had already been solved in many different ways (for instance, Chandra-Toueg). So what makes Paxos special?

Together, these properties make Paxos practical-ish for real world use.

Complexity

Unfortunately, Paxos is also complex.

The core of the Paxos algorithm is not especially complicated. Its author, Leslie Lamport, wrote a paper called “Paxos Made Simple” whose abstract, in its entirety, is:

The Paxos algorithm, when presented in plain English, is very simple.

But here are some other things Leslie Lamport has written about Paxos.

“In 2015, Michael Deardeuff of Amazon informed me that one sentence in [‘Paxos Made Simple’] is ambiguous, and interpreting it the wrong way leads to an incorrect algorithm. … I am not going to remove this ambiguity or reveal where it is.” [link]

“People reading the paper apparently got so distracted by the Greek parable that they didn’t understand the algorithm. Among the people I sent the paper to, and who claimed to have read it, were Nancy Lynch, Vassos Hadzilacos, and Phil Bernstein. A couple of months later I emailed them the following question: ‘Can you implement a distributed database that can tolerate the failure of any number of its processes (possibly all of them) without losing consistency, and that will resume normal behavior when more than half the processes are again working properly?’ None of them noticed any connection between this question and the Paxos algorithm.” [link]

And here are some things other people have written about Paxos.

“Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable.” [link]

“We have described our implementation of a fault-tolerant database, based on the Paxos consensus algorithm. … [I]t was significantly harder to build this system [than] originally anticipated. … There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system.” [link]

Some have claimed that Paxos’s complexity arises from specific design choices, and that a set of different choices can make this complexity disappear. But others disagree:

“The Raft paper claims that Raft is significantly more understandable than Paxos … On the contrary, we find that the two algorithms are not significantly different in understandability[.]” [link]

I am reminded of the famous first paragraph of David L. Goodstein’s textbook States of Matter:

“Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics.”

Now it is our turn to study Paxos.

Aside: On viewstamped replication

Paxos was developed by Leslie Lamport at about the same time as Viewstamped Replication, a protocol by Brian Oki and Barbara Liskov whose core is effectively identical. In fact, because Lamport’s paper describing Paxos was so odd, VR was published 10 years before Paxos—and VR covers view changes, which Lamport leaves underspecified! So why is Paxos so much more famous? Because “Paxos” is a great word; because Lamport talked about it widely before publication, and important systems implemented it; because despite its in-jokes, Lamport’s “Part-Time Parliament” paper is written less drily than the original Viewstamped Replication paper; and maybe because Lamport was right—the setting of the “Part-Time Parliament” was another “success at popularizing the consensus problem.” Nevertheless, “Viewstamped Replication Revisited” is one of the best papers to learn Paxos from, and I’m thinking VR every time I say Paxos.

Sub-protocols

Paxos’s complexity comes from its combination of four sub-protocols. They are:

  1. A fast core common-case protocol, where a stable leader forces consensus by proposing its value to all servers, who acknowledge it.

  2. A recovery protocol, where non-leader servers that miss a consensus determination (because of message drops or delays) can catch up to the consensus value.

  3. A leader change protocol, where if a leader fails, a backup takes over.

  4. A view change protocol, where a consensus group is healed after a failure (or proactively modified before a failure) by adding or removing servers.

Though sub-protocols 1–3 can use a single set of messages, these messages are serving different purposes at different times. Most presentations of Paxos or Paxos-like protocols initially skip the view change protocol (we will too, though it is critical in real-world deployments). Furthermore, some papers choose to present Paxos in the context of a single consensus decision, while others focus on the real-world context of state machine replication and describe messages that cover a sequence of consensus decisions. Nevertheless, the underlying protocol is the same, relying on majority quorums for safety, leaders for progress, and stable leaders for efficiency.

Core Paxos

We now turn to a description of the core common case of the Paxos protocol, which assumes that a distinguished working leader $$\ell$$ has been preassigned. After a description of the message processing steps, we turn to recovery and leader changes. We’ll delay view changes to later.

A summary of the core Paxos messages:

Server $$s$$ is... Messages Notes
Leader $$\ell$$ $$\forall j, \text{SEND}_{\ell \to j}(\text{PROBE}, r)$$
$$\forall j, \text{RECV}_{\ell \gets j}(\text{PREPARE}, r[, \textit{ar}_j, \textit{av}_j])$$
$$\forall j, \text{SEND}_{\ell \to j}(\text{PROPOSE}, r, \textit{av}_\ell)$$
$$\forall j, \text{RECV}_{\ell \gets j}(\text{ACK}, r)$$
$$\forall j, \SEND{\ell}{j}{\text{DECIDE}, r}$$
Non-leader with no previous proposal $$\text{RECV}_{s \from \ell}(\text{PROBE}, r)$$
$$\text{SEND}_{s \to \ell}(\text{PREPARE}, r)$$
$$\text{RECV}_{s \from \ell}(\text{PROPOSE}, r, v)$$
$$\text{SEND}_{s \to \ell}(\text{ACK}, r)$$
Servers maintain $$\textit{pr}_s$$, $$\textit{ar}_s$$, and $$\textit{av}_s$$, and ignore stale PROBE and PROPOSE messages
Non-leader with previous proposal $$\text{RECV}_{s \from \ell}(\text{PROBE}, r)$$
$$\text{SEND}_{s \to \ell}(\text{PREPARE}, r, \textit{ar}_s, \textit{av}_s)$$
$$\text{RECV}_{s \from \ell}(\text{PROPOSE}, r, v)$$
$$\text{SEND}_{s \to \ell}(\text{ACK}, r)$$
Decider $$\RECV{s}{j}{\text{DECIDE}, r}$$
$$\SEND{s}{\text{Nancy}}{\text{DECIDE}, \textit{av}_s}$$
Servers ignore DECIDE messages from future rounds

In detail, the algorithm works as follows.

Servers

As before, we have consensus servers $$1 \leq s \leq N$$, plus a distinguished initializer, 0, and a distinguished recorder, Nancy. Paxos allows up to $$f = \lfloor (N-1)/2\rfloor$$ of the consensus servers to fail. If $$N\leq 2$$ none of them can fail; if $$3 \leq N \leq 4$$, one can fail; if $$5 \leq N \leq 6$$, two can fail; and so forth. 0 and Nancy never fail.

Network model

The network model for Paxos is asynchronous unreliable messages (unlike CT consensus, which requires asynchronous reliable messages). In this model, messages not involving Nancy or the initializer may be arbitrarily delayed, reordered, duplicated, or dropped. Messages are not corrupted, however—every message that arrives was intentionally sent by a Paxos server.

Paxos achieves safety under all network conditions, but liveness (actually achieving consensus) requires that the network function “well enough.” For instance, in a 3-server network, a 100%-loss-rate partition between server 1 and servers 2 and 3 will always frustrate consensus, but a 90% loss rate on the same links will not (if you wait long enough, enough messages in some round will get through to achieve consensus).

Failure model

Paxos servers may fail, but in addition to fail-stop failures, they may fail and restart. There are two ways to think about restarted servers.

  1. A failed-and-restarted server might be equivalent to one that left the protocol for a while because of a selective network failure. The asynchronous unreliable network model already models this, so there is no need to describe restarting as part of the protocol.

  2. A failed-and-restarted server might be allowed to forget some of its internal state, meaning so a failed-and-restarted server requires separate explanation.

The partial-forgetting model is pleasing theoretically, but practical implementations face difficulties. Only some server state can be forgotten; and any remembered state had to be stored durably. But writing information durably to stable storage can be slower than sending a bunch of network messages! So this presentation follows the first model. For a more realistic presentation of partially-forgotten state, see “Viewstamped Replication Revisited”.

(For your information, in the partial-forgetting model, failed-and-restarted servers would have to remember their $$\textit{pr}_s$$, $$\textit{ar}_s$$, $$\textit{av}_s$$, and $$d_s$$ variables.)

Protocol phases

The core protocol is described in terms of phases, but the phases are meaningful only for the distinguished leader. Non-leaders just process incoming messages as they arrive, jumping between PROBE, PROPOSE, and DECIDE messages at will. (Contrast this with CT consensus, where non-leaders must stash and reprocess out-of-order messages, and non-leaders block in the propose phase.) Furthermore, any server can try to become leader and jump to the probe phase, as described below.

Initialization phase

  1. The initializer sends an INIT message to each consensus server including $$N$$, the leader $$\ell$$, and, in the leader’s case, some initial value $$\textit{iv}_\ell$$. Each consensus server also maintains four more variables: a probe round $$\textit{pr}_s$$; an acknowledgement round $$\textit{ar}_s$$; an acknowledged value $$\textit{av}_s$$; and a decision flag $$d_s$$. The probe round and the acknowledgement round increase monotonically; they record the latest round number heard from in a PROBE and the latest round number whose proposed value was acknowledged. Both $$\textit{pr}_s$$ and $$\textit{ar}_s$$ are initially zero, and $$\textit{av}_s$$ is initially empty. The decision flag says whether this server has decided (sent a DECIDE message to Nancy); it’s initially false.

  2. Each server waits to receive that INIT message.

Probe phase

In the probe phase, a new leader picks a new round number larger than any it has heard before and broadcasts that round number to the consensus servers. This phase solicits information about what the other servers know. The probe round number must be unique to this leader, meaning that any other leader would never select it for PROBE messages. A simple way to achieve this is for leader $$\ell$$ to choose round numbers that satisfy $$r \bmod N = \ell - 1$$.

  1. The leader $$\ell$$ selects a unique round number $$r > \textit{pr}_\ell$$.

  2. The leader sends a PROBE message to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROBE}, r}$$.

  3. When server $$s$$ receives a $$\langle \text{PROBE}, r\rangle$$ message from some leader $$\ell$$, it compares that message’s $$r$$ to its local probe round $$\textit{pr}_s$$.

    • If $$r < \textit{pr}_s$$, this is a message from a previous round. It is ignored.

    • If $$r \geq \textit{pr}_s$$, this is a message from a new round (or a duplicate). Server $$s$$ sets $$\textit{pr}_s \gets r$$ and responds with a PREPARE message containing its latest acknowledged value, if any. This message has the form:

      $$ \begin{cases} \SEND{s}{\ell}{\text{PREPARE}, \textit{pr}_s} & \text{when\ } \textit{ar}_s = 0, \\ \SEND{s}{\ell}{\text{PREPARE}, \textit{pr}_s, \textit{ar}_s, \textit{av}_s} & \text{when\ } \textit{ar}_s > 0. \end{cases} $$

Propose phase

The leader waits for a quorum of PREPARE messages, selects a value, and proposes it to the consensus servers.

  1. The leader $$\ell$$ waits to receive $$\langle \text{PREPARE}, r[, \textit{ar}_j, \textit{av}_j] \rangle$$ messages from a quorum of more than $$N/2$$ servers that includes itself.

    $$\ell$$ first checks whether its leadership has been usurped. If in the meantime another server sent $$\ell$$ a PROBE or PREPARE message for a later round, $$\ell$$ might have $$\textit{pr}_\ell > r$$; and if this happens, its propose phase is stale and it gives up on it.

    Otherwise $$r = \textit{pr}_\ell$$, and it picks a proposal value $$v$$ from the quorum of responses.

    • If none of the received PREPARE messages have an acknowledged value $$\textit{av}_i$$, $$\ell$$ selects its initial value: $$v \gets \textit{iv}_\ell$$.

    • Otherwise, $$\ell$$ selects some message $$M_j = \langle \text{PREPARE}, r, \textit{ar}_j, \textit{av}_j\rangle$$ with the maximum $$\textit{ar}_j$$, and sets $$v \gets \textit{av}_j$$.

  2. The leader $$\ell$$ sends a PROPOSE message with this proposal value to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROPOSE}, \textit{pr}_\ell, v}$$.

  3. When server $$s$$ receives a $$\langle \text{PROPOSE}, r, v\rangle$$ message from some leader $$\ell$$, it again compares that message’s $$r$$ to its local probe round $$\textit{pr}_s$$.

    • If $$r < \textit{pr}_s$$, this is a message from a previous round. It is ignored.

    • If $$r \geq \textit{pr}_s$$, this message has a new value. Server $$s$$ stores and acknowledges that value, remembering the round in which it was found: $$\textit{pr}_s \gets r$$; $$\textit{ar}_s \gets r$$; $$\textit{av}_s \gets v$$; $$\SEND{s}{\ell}{\text{ACK}, \textit{ar}_s}$$.

Acknowledge phase

The leader collects acknowledgements and potentially decides.

  1. The leader $$\ell$$ waits to receive $$\langle \text{ACK}, r\rangle$$ messages from a quorum of more than $$N/2$$ servers that includes itself, then decides and sends the decision round to all servers: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{DECIDE}, r}$$.

Note that, unlike in the propose phase, there is no need for $$\ell$$ to check whether its leadership has been usurped! The correctness argument shows that if $$\ell$$ gets a quorum of ACKs for round $$r$$, then any later decision must agree with $$\ell$$’s proposal from that round.

Update, 3/15. An earlier draft said that the leader sent its acknowledgement round $$\textit{ar}_\ell$$ in its DECIDE message, not the decision round $$r$$. Both choices work, but it’s better to send the decision round—recipients of DECIDE messages privilege earlier rounds rather than later ones.

Decide phase

Finally, servers that hear a DECIDE message for their current acknowledgement round, or any previous round, can send a DECIDE to Nancy.

  1. When a server receives a $$\langle \text{DECIDE}, r\rangle$$ message, it checks its decide flag $$d_s$$ and acknowledgement round $$\textit{ar}_s$$.

    • If $$d_s$$ is true, this server already decided. The message is ignored.

    • Otherwise, if $$r > \textit{ar}_s$$, this message is from a future round—and there’s a chance that the value decided in that round differs from the one we have acknowledged. The message is ignored (but see below).

    • Otherwise, $$r \leq \textit{ar}_s$$: this message is from the current round or a previous round. Paxos ensures that if any server ever decides, then that decision is carried forward to all future rounds, so this server’s current acknowledged value must be the decided value. The server sends that value to Nancy, $$\SEND{s}{\text{Nancy}}{\text{DECIDE}, \textit{av}_s}$$, and sets $$d_s \gets \text{true}$$.

Cleanup phase

In CT consensus, a server effectively turns itself off after deciding. Not in Paxos! Every working server must remain alive and processing messages. A decided leader $$\ell$$ will not actively send updates involving this consensus instance, but other working servers in the system may have missed the decision due to a burst of network loss. The consensus problem definition requires such stragglers to decide eventually, so decided servers must help bring the stragglers up to date. In this presentation this uses failure recovery. The consensus instance can terminate once all servers decide or a view change occurs.

Failure recovery

This protocol works as long as the designated leader doesn’t fail. But of course the designated leader can fail, or be isolated by a network issue in a realm where it can no longer contact a quorum to make forward progress. And messages can be dropped, causing non-leaders to stall out.

An aspect of Paxos’s elegance is that the messages and message handlers above can handle all of these failure situations with one additional rule.

  1. Failure recovery. Every server involved in a consensus instance sets a failure detector when it begins the protocol. When a server’s failure detector goes off, it assumes the leadership role, resets its failure detector, and executes the protocol starting at the probe phase.

And that’s it. Assuming leadership restarts protocol chatter, and PREPARE messages will eventually surface any already-decided value (or allow the new leader to choose its own value). If the presumptive leader’s initial round is too old to elicit responses, it will fail again and restart with a higher round; eventually it will hear back.

However, failure detectors for Paxos must satisfy some constraints. If the servers’ failure detectors are too eager, then no leader will get a chance to achieve consensus. Another bad scenario is when multiple servers’ failure detectors go off simultaneously, which can cause the servers to repeatedly interrupt one another’s majorities. A failure detector that does work is one with a reasonable minimum delay (at least 10 round-trip times) plus significant random jitter (for a total delay ranging from 10–20 round-trip times or more). The minimum delay gives each leader a chance to make progress, and the random jitter makes repeated leader conflicts unlikely.

Fast recovery

Although leader change can handle all Paxos failures, that doesn’t mean it should. Efficient versions of Paxos can add more messages, or repurpose existing messages, to help servers recover faster. Some examples:

Mappings

The version of Paxos presented here uses class terminology. The terms in other presentations usually differ. Here are mappings for several important ones.

“The Part-Time Parliament”

This document “Part-Time Parliament” Appendix A
Concepts and roles
Leader President
Server Priest
Round Ballot number
Value Decree
Variables
$$\textit{pr}_i$$ $$\textit{nextBal}[i]$$
$$\textit{ar}_i$$ $$\textit{prevBal}[i]$$
$$\textit{av}_i$$ $$\textit{prevDec}[i]$$; $$\textit{decree}[i]$$ (at leaders); and, if $$d_i$$ is true, $$\textit{outcome}[i]$$
$$d_i$$ $$\textit{outcome}[i] \neq \text{BLANK}$$
Leader’s $$r$$ (also $$\textit{pr}_\ell$$) $$\textit{lastTried}[i]$$
$$i = \ell$$ $$\textit{status}[i] \neq \textit{idle}$$
Leader in probe phase $$\textit{status}[i] = \textit{trying}$$
Leader in prepare phase $$\textit{status}[i] = \textit{polling}$$
PREPARE messages received $$\textit{prevVotes}[i]$$
ACK messages received $$\textit{voters}[i]$$
(Not used) $$\textit{quorum}[i]$$
Sender of a PROBE or PROPOSE message $$\textit{owner}(\textit{ballotNumber})$$
Messages
$$\langle \text{PROBE}, r\rangle$$ $$\textit{NextBallot}(\textit{lastTried}[i])$$
$$\langle \text{PREPARE}, \textit{pr}_i\rangle$$ $$\textit{LastVote}(\textit{nextBal}[i], v)$$ when $$\textit{prevBal}[i] = -\infty $$
$$\langle \text{PREPARE}, \textit{pr}_i, \textit{ar}_i, \textit{av}_i\rangle$$ $$\textit{LastVote}(\textit{nextBal}[i], v)$$ when $$\textit{prevBal}[i] \neq -\infty $$
$$\langle \text{PROPOSE}, \textit{pr}_i, v\rangle$$ $$\textit{BeginBallot}(\textit{lastTried}[i], v)$$
$$\langle \text{ACK}, \textit{ar}_i\rangle$$ $$\textit{Voted}(\textit{prevBal}[i], i)$$
$$\langle \text{DECIDE}, \textit{ar}_i\rangle$$ $$\textit{Success}(\textit{outcome}[i])$$
Note that the $$\textit{Success}$$ message contains a decree (value), whereas the DECIDE message contains a round number. Our choice, which follows VR and Raft, makes the protocol more streamlined in the common case.

“Paxos Made Simple”

This document “Paxos Made Simple”
Concepts and roles
Leader Leader/distinguished proposer/distinguished learner
Server Proposer/acceptor/learner
Round Proposal number
Value Value
Variables
$$\textit{pr}_i$$ “A promise never again to accept a proposal numbered less than $$n$$”
$$\textit{ar}_i$$, $$\textit{av}_i$$ “The proposal with the highest number less than $$n$$ that it has accepted”
Messages
$$\langle \text{PROBE}, r\rangle$$ prepare request
$$\langle \text{PREPARE}, \textit{pr}_i[, \textit{ar}_i, \textit{av}_i]\rangle$$ Response to prepare request
$$\langle \text{PROPOSE}, \textit{pr}_i, v\rangle$$ accept request
$$\langle \text{ACK}, \textit{ar}_i\rangle$$ Response to accept request
$$\langle \text{DECIDE}, \textit{ar}_i\rangle$$ “…each acceptor, whenever it accepts a proposal, respond[s] to all learners”

“Viewstamped Replication Revisited”

Unlike our presentation, viewstamped replication is described in terms of a replicated state machine: it decides a sequence of operations, rather than coming to consensus on one value. Furthermore, viewstamped replication uses its view change protocol to handle failures as well as view changes (additions or subtractions from the set of servers), and has a separate recovery protocol that lets stragglers catch up without claiming leadership (an optimization relative to our protocol). So the mapping isn’t clean. Nevertheless, there are correspondences.

This document “Viewstamped Replication Revisited”
Concepts and roles
Leader Primary
Server Replica
Round View number
Value Log entry
Variables
$$\textit{pr}_i$$ The current view-number
$$\textit{d}_i$$ True for an operation iff commit-number ≥ that operation’s op-number
Messages
$$\langle \text{INIT}, \textit{iv}_\ell\rangle$$ to leader $$\langle \text{REQUEST}\ \textit{op}, c, s\rangle$$, where VR $$\langle\textit{op},c,s\rangle \approx$$ Paxos $$\textit{iv}_\ell$$
$$\langle \text{PROPOSE}, \textit{pr}_i, v\rangle$$ $$\langle \text{PREPARE}\ v, m, n, k\rangle$$, where VR $$v \approx$$ Paxos $$\textit{pr}_i$$ and VR $$m = \langle\textit{op},c,s\rangle \approx$$ Paxos $$v$$
$$\langle \text{ACK}, \textit{ar}_i\rangle$$ $$\langle \text{PREPAREOK}\ v, n, i\rangle$$, where VR $$v \approx$$ Paxos $$\textit{ar}_i$$
$$\langle \text{DECIDE}, \textit{ar}_i\rangle$$ $$\langle \text{COMMIT}\ v, k\rangle$$, where VR $$v \approx$$ Paxos $$\textit{ar}_i$$
(The VR PREPARE message also decides via its $$k$$ component.)
$$\langle \text{PROBE}, r\rangle$$ $$\langle \text{STARTVIEWCHANGE}\ v, i\rangle$$
VR makes view changes an explicitly exceptional path. In the absence of loss, the VR leader’s PREPARE messages contain enough information to subsume PROBE.
$$\langle \text{PREPARE}, \textit{pr}_i\rangle$$ $$\langle \text{DOVIEWCHANGE}\ v, l, v', n, k, i\rangle$$

Correctness sketch: Safety

The correctness of Paxos follows from majority quorum overlapping and from monotonically increasing rounds—the “promise never again to accept” messages from previous rounds.

Say that some server receives a $$\langle \text{DECIDE}, r \rangle$$ message from the leader $$\ell_r$$ of round $$r$$. Well, that message could have been sent only after a majority quorum of servers sent $$\ell_r$$ an $$\langle \text{ACK}, r\rangle$$ message acknowledging the value $$V$$ that $$\ell_r$$ proposed. That value $$V$$ may have been sent to Nancy (which happens right after DECIDE is received), so safety requires that all future decisions equal $$V$$.

If no later round has proposed a value, then we know no conflicting decision could have occurred. But some later round might have proposed a value. So consider $$r'$$, the minimum round $$> r$$ that successfully proposes a value (i.e., sends at least one PROPOSE message).

Apply induction and you can show that all future rounds will agree with the decision in round $$r$$.

References

The Part-Time Parliament. Leslie Lamport. ACM TOCS 16(2), May 1998. (ACM linkauthor’s personal history)

Paxos made simple. Leslie Lamport. ACM SIGACT News 32(4), Dec. 2001. (author’s personal history)

Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. Brian M. Oki and Barbara H. Liskov. Proc. PODC ’88, Jan. 1988.

Viewstamped replication revisited. Barbara Liskov and James Cowling. MIT-CSAIL-TR-2012-021, July 2012.

In search of an understandable consensus algorithm [i.e., Raft]. Diego Ongaro and John Ousterhout. Proc. USENIX ATC ’14, June 2014. (authors’ extended version)

[Paxos code handout.] Robert Morris, 2014.