Multi-Paxos

Multi-Paxos, or iterated Paxos, brings Paxos to state machine replication by coming to consensus on a sequence of values, rather than one value. Viewstamped replication and Raft are Multi-Paxos algorithms. But how does Multi-Paxos work, and how is it more optimizable than single-decree Paxos (the “Synod protocol”)?

Sequence consensus

We start with a new problem definition, sequence consensus. The consensus problem decides on one value; sequence consensus decides on a sequence of values. This is an obvious extension, but it’s good to describe it precisely.

Each server $$s$$ is given an infinite sequence of initial values $$\textit{IV}_s = [\textit{iv}_s[0], \textit{iv}_s[1], \dots]$$. Positions in this sequence are called slots.
Each working server sends Nancy an infinite sequence of decision messages. Call these Nancy-messages. Each Nancy-message contains a sequence of values, and each Nancy-message sent by a server must be longer than any previous Nancy-message sent by that server.
For any slot $$i \geq 0$$:
1. Each working server must eventually send a Nancy-message that is longer than $$i$$ (termination).
2. Every Nancy-message longer than $$i$$ must have the same value at slot $$i$$ (agreement).
3. If all servers have the same initial value for slot $$i$$, then the value in Nancy-messages at slot $$i$$ must equal that initial value (unanimity).

Repeated Paxos

Now that we have a problem definition, we can solve it. This section describes a Paxos-based solution to the sequence consensus problem. Differences from single-decree Paxos are highlighted.

This protocol is not yet optimized and it is not yet Multi-Paxos: it has no stable leader across rounds and makes no effort to minimize message lengths. Our next description introduces the stable-leader optimization.

Initialization phase

Each consensus server maintains five variables: $$\textit{IV}_s$$, its initial value sequence; $$\textit{pr}_s$$, its probe round (initially 0); $$\textit{ar}_s$$, its acknowledgement round (initially 0); $$\textit{AV}_s$$, its acknowledged value sequence (initially empty); and $$\textit{DV}_s$$, its decided value sequence (initially empty).

Initialization is otherwise the same as single-decree Paxos.

Probe phase

In the probe phase, a new leader picks a new round number larger than any it has heard before and broadcasts that round number to the consensus servers. The probe round number must be unique to this leader.

The leader $$\ell$$ selects a unique round number $$r > \textit{pr}_\ell$$.
The leader sends PROBE to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROBE}, r}$$.
When server $$s$$ receives $$\langle \text{PROBE}, r\rangle$$ from some leader $$\ell$$:
- If $$r < \textit{pr}_s$$, this is a message from a previous round. It is ignored.
- If $$r \geq \textit{pr}_s$$, server $$s$$ sets $$\textit{pr}_s \gets r$$ and responds with its acknowledged value sequence: $$\SEND{s}{\ell}{\text{PREPARE}, \textit{pr}_s, \textit{ar}_s, \textit{AV}_s}$$.

Propose phase

The leader waits for a quorum of PREPARE messages, selects a value sequence, and proposes it to the consensus servers.

The leader $$\ell$$ waits to receive $$\langle \text{PREPARE}, r, \textit{ar}_j, \textit{AV}_j\rangle$$ messages from a quorum of more than $$N/2$$ servers that includes itself.

If $$r < \textit{pr}_\ell$$, then $$\ell$$’s leadership has been usurped; it gives up on its propose phase.

Otherwise it finds some message $$M_j = \langle \text{PREPARE}, r, \textit{ar}_j, \textit{AV}_j\rangle$$ with the maximum $$\textit{ar}_j$$. It then selects $$V$$, the value sequence to propose. $$V$$ must begin with $$\textit{AV}_j$$, the value sequence in $$M_j$$, but the leader can append zero or more elements of its initial value sequence $$\textit{IV}_\ell$$ at the end:

|V| \geq |\textit{AV}_j| \text{\ and\ } V[i] = \begin{cases} \textit{AV}_j[i] & \text{if\ } i < |\textit{AV}_j|, \\ \textit{IV}_\ell[i] & \text{otherwise.} \end{cases}

(Liveness requires that leaders continually extend the sequence, so $$V$$ will generally be longer than $$\textit{AV}_j$$.)
The leader $$\ell$$ sends a PROPOSE message with this proposal value sequence to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROPOSE}, \textit{pr}_\ell, V}$$.
When server $$s$$ receives a $$\langle \text{PROPOSE}, r, V\rangle$$ message from some leader $$\ell$$, it again compares that message’s $$r$$ to its local probe round $$\textit{pr}_s$$.
- If $$r < \textit{pr}_s$$, this is a message from a previous round. It is ignored.
- If $$r \geq \textit{pr}_s$$, this message has a new value sequence. Server $$s$$ stores and acknowledges it, remembering the round in which it was found: $$\textit{pr}_s \gets r$$; $$\textit{ar}_s \gets r$$; $$\textit{AV}_s \gets V$$; $$\SEND{s}{\ell}{\text{ACK}, \textit{ar}_s}$$.

Note that step 3 might shorten an acknowledged value sequence! This can happen if a leader in a previous round failed to replicate its value sequence to a majority quorum. For instance, say that 9 slots have been decided, but server 3’s failure detector goes off because it has been temporarily partitioned. Server 3 enters round 3 and proposes a sequence with 12 slots, but only server 3 itself processes that PROPOSE. In the meantime, some other leader enters round 4 and proposes a sequence with 10 slots. When the partition heals, server 3 receives the round-4 PROPOSE; this is larger than its round, so it replaces its own 12-slot sequence with the 10-slot sequence. This is safe because slots 10 and 11 were never acknowledged by a majority, so they could not have been decided.

Acknowledge phase

The leader collects acknowledgements and potentially decides.

The leader $$\ell$$ waits to receive $$\langle \text{ACK}, r\rangle$$ messages from a quorum of more than $$N/2$$ servers that includes itself. It then decides, and sends the decision round to all servers.¹ It also includes $$|V|$$, the length of the proposal value sequence it sent in $$r$$’s propose phase. $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{DECIDE}, r, \htmlClass{hl}{|V|} }$$.

As in single-decree Paxos, a leader may respond to ACKs even if its leadership has been usurped ($$\textit{pr}_\ell > r$$).

Decide phase

Finally, servers that hear a DECIDE message for their current acknowledgement round can send a DECIDE to Nancy.

When a server receives a $$\langle \text{DECIDE}, r, \htmlClass{hl}{|V|}\rangle$$ message, it checks its acknowledgement round $$\textit{ar}_s$$ and decided value sequence $$\textit{DV}_s$$.
- If $$ |V| \leq |\textit{DV}_s| $$, this server has already decided all slots covered by this DECIDE message. The message is ignored.
- If $$r > \textit{ar}_s$$, the message is from a future round; it is ignored.
- Otherwise, this message is from the current round or a previous round, and it contains at least one previously-undecided slot. The server sets $$\textit{DV}_s$$ to the decided prefix of $$\textit{AV}_s$$, which is its first $$|V|$$ elements: $$\textit{DV}_s \gets \textit{AV}_s[0,\dots,|V|-1]$$. Then it sends that value sequence in a Nancy-message: $$\SEND{s}{\text{Nancy}}{\text{DECIDE}, \htmlClass{hl}{\textit{DV}_s}}$$.

Failure recovery and extension

The failure recovery rule in repeated Paxos is familiar:

Failure recovery. Every server involved in a consensus instance sets a failure detector when it begins the protocol. When a server’s failure detector goes off, it assumes the leadership role, resets its failure detector, and executes the protocol starting at the probe phase.

In single-decree Paxos, a server turns off its failure detector after deciding on a value. In repeated Paxos, however, each server resets its failure detector whenever a decision is reached (at the end of the decide phase). When decided servers restart the protocol, their propose phases will lengthen the previously-decided value sequence and induce longer Nancy-messages, as the sequence consensus problem requires.

Correctness sketch: Safety

Say that some server $$s$$ receives a $$\langle \text{DECIDE}, r, |V| \rangle$$ message from the leader $$\ell_r$$ of round $$r$$, which had proposal value sequence $$V$$. Safety requires that both $$s$$’s acknowledged value sequence $$\textit{AV}_s$$ and all future Nancy-messages agree with $$V$$ in their first $$|V|$$ positions.

Consider $$r'$$, the minimum round $$> r$$ that successfully proposes a value (i.e., sends at least one PROPOSE message).

The members of the majority quorum that ACKed round $$r$$ set $$\textit{ar}_i = r$$ and $$\textit{AV}_i = V$$ (though, because of message delays, it’s possible that there was no instant when the whole quorum had the same $$\textit{ar}_i$$).
The leader of round $$r'$$ could have sent a PROPOSE message only after receiving PREPARE messages from a majority quorum of servers.
That majority must overlap with the ACK majority from round $$r$$. And furthermore, all servers in that overlap must have ACKed round $$r$$ before sending a PREPARE for round $$r'$$. (Any server that PREPAREs for $$r'$$ must ignore later PROPOSEs for smaller rounds.)
And that means that the leader of round $$r'$$ must have heard about $$V$$ in one of the PREPARE messages, and it must therefore have selected value sequence $$V$$ as the prefix of its own proposal.

Apply induction and you can show that all future rounds will agree with the decision sequence in round $$r$$. As for $$\textit{AV}_s$$, there are two possibilities. If $$s$$ assigned $$\textit{AV}_s$$ in round $$r$$, it will exactly equal $$V$$. Otherwise, $$s$$ assigned it in a later round; but as we showed, all future leaders’ proposal value sequences will have prefixes that match $$V$$.

Multi-Paxos

Repeated Paxos shows how repeated executions of Paxos can solve the sequence consensus problem. But assuming a relatively stable leader lets us optimize the protocol further. If a stable leader can remain in one round while extending the decision sequence, many PROBE-PREPARE message exchanges can be avoided. This gives us the classic Multi-Paxos algorithm.

Differences from repeated Paxos are highlighted. They all concern the central change, which is that in Multi-Paxos, two messages with the same round and same message type might concern different consensus decisions on different slots. One way to think about it: in repeated Paxos, each round number involves one wave of messages that decides on one set of slots; but in Multi-Paxos, many waves of messages with the same round number decide different sets of slots. This means the protocol must in more places make decisions based on the pair of round number and sequence length, rather than just round number; and more messages must contain sequence lengths to expose this information.

Consensus servers maintain the same variables as in repeated Paxos.

The initialization phase is the same as repeated Paxos.

Probe phase

The probe phase is the same as repeated Paxos.

Propose phase

Both PROPOSE computation and PROPOSE handling must be updated to consider sequence lengths.

The leader $$\ell$$ waits to receive $$\langle \text{PREPARE}, r, \textit{ar}_j, \textit{AV}_j\rangle$$ messages from a quorum of more than $$N/2$$ servers that includes itself.

If $$r < \textit{pr}_\ell$$, then $$\ell$$’s leadership has been usurped; it gives up on its propose phase.

Otherwise it finds the subset of received $$\langle \text{PREPARE}, r, \textit{ar}_j, \textit{AV}_j\rangle$$ messages with maximal $$\textit{ar}_j$$; and from those, it picks some message $$M_j$$ whose $$\textit{AV}_j$$ is longest. (This is necessary because round $$\textit{ar}_j$$ might have been extended multiple times; it is safe because the same leader proposed all value sequences with the same $$\textit{ar}$$ and ensured that all such sequences agree on every common slot.) $$\ell$$ then selects $$V$$ by appending zero or more elements of its initial value sequence to the chosen $$\textit{AV}_j$$:

|V| \geq |\textit{AV}_j| \text{\ and\ } V[i] = \begin{cases} \textit{AV}_j[i] & \text{if\ } i < |\textit{AV}_j|, \\ \textit{IV}_\ell[i] & \text{otherwise.} \end{cases}
The leader $$\ell$$ sends a PROPOSE message to all servers as in repeated Paxos.
When server $$s$$ receives a $$\langle \text{PROPOSE}, r, V\rangle$$ message from some leader $$\ell$$:
- If $$r < \textit{pr}_s$$, or $$r = \textit{ar}_s$$ and $$|V| < |\textit{AV}_s|$$, this is an old message. It is ignored.
- Otherwise, it is stored and acknowledged. The acknowledgement includes the length of the acknowledged value sequence. $$\textit{pr}_s \gets r$$; $$\textit{ar}_s \gets r$$; $$\textit{AV}_s \gets V$$; $$\SEND{s}{\ell}{\text{ACK}, \textit{ar}_s, \htmlClass{hl2}{|\textit{AV}_s|}}$$.

Acknowledge phase

The leader collects acknowledgements and potentially decides.

The leader $$\ell$$ waits to receive $$\langle \text{ACK}, r, \htmlClass{hl2}{w_k}\rangle$$ messages with possibly-different acknowledged lengths $$w_k$$ from a quorum of more than $$N/2$$ servers that includes itself. It then decides on the prefix understood by the quorum: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{DECIDE}, r, \htmlClass{hl2}{\min w_k} }$$.

Note that ACKs from different PROPOSE messages within the same round may be combined into a single quorum, as long as all ACKs in the quorum are from different servers.

Extension phase

A leader can keep proposing new value sequences until it is usurped.

A leader who has sent at least one PROPOSE in round $$r$$ and has not been usurped can at any time extend its previous proposal by appending more elements of $$\textit{IV}_\ell$$, its initial value sequence. The resulting new proposal value sequence $$V'$$ must contain any previously-proposed value sequence from the same round as a prefix. The messages for an extension PROPOSE are $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROPOSE}, \textit{pr}_\ell, V'}$$. Servers respond to this PROPOSE just as they would to a normal PROPOSE, modulo the length check introduced above.
After sending an extension PROPOSE, the leader can reset its failure detector and re-enter the acknowledge phase. In fact, leaders’ extension, acknowledge, and decide phases can interleave arbitrarily: a leader can send a new extension PROPOSE before collecting ACKs from the previous one, and the $$\min w_k$$ rule in the acknowledge phase correctly handles quorums that mix ACKs from different extensions.

Decide phase

When a server receives a $$\langle \text{DECIDE}, r, \htmlClass{hl2}{w}\rangle$$ message, it checks its acknowledgement round $$\textit{ar}_s$$, acknowledged value sequence $$\textit{AV}_s$$, and decided value sequence $$\textit{DV}_s$$.
- It first sets $$w' \gets \min\ \{w, |\textit{AV}_s|\}$$. $$w'$$ represents decidable slots for which this server has values; message loss during an extension phase might mean this server hasn’t yet heard about all $$w$$ decided values, even though it’s in the same round as the leader. The server can only decide on the slots for which it has a value. Future PROPOSE and DECIDE messages (or failure recovery) will eventually cover the remaining slots.
- If $$ \htmlClass{hl2}{w'} \leq |\textit{DV}_s| $$ or $$r > \textit{ar}_s$$, the message is ignored.
- Otherwise, the server sets $$\textit{DV}_s \gets \textit{AV}_s[0,\dots,\htmlClass{hl2}{w'}-1]$$ and sends that value sequence in a Nancy-message: $$\SEND{s}{\text{Nancy}}{\text{DECIDE}, \textit{DV}_s}$$.

Failure recovery

Failure recovery is the same as in repeated Paxos.

Correctness sketch: Safety

The correctness argument for Multi-Paxos is the same as for repeated Paxos, with one addition. In repeated Paxos, all servers that ACK in a given round receive the same value sequence $$V$$, so all messages with the maximum $$\textit{ar}_j$$ contain the same value sequence. In Multi-Paxos, extension PROPOSEs mean that two servers with the same $$\textit{ar}_j$$ may have different-length acknowledged value sequences; we must ensure that all values that had been decided in round $$\textit{ar}_j$$ are included in the sequence we pick. The answer is simply to pick the longest sequence of those with maximum $$\textit{ar}_j$$. No slot beyond this sequence could have been decided, because of majority quorums; and there can be no conflict with other sequences in the same round, because subsequent PROPOSEs sent by the same leader within the same round must be extensions, never overwrites.

Message summary

Paxos messages	Repeated Paxos messages	Multi-Paxos messages
$$\text{SEND}_{\ell \to j}(\text{PROBE}, r)$$ $$\text{RECV}_{\ell \gets j}(\text{PREPARE}, r[, \textit{ar}_j, \textit{av}_j])$$ $$\text{SEND}_{\ell \to j}(\text{PROPOSE}, r, \textit{av}_\ell)$$ $$\text{RECV}_{\ell \gets j}(\text{ACK}, r)$$ $$\SEND{\ell}{j}{\text{DECIDE}, r}$$ $$\SEND{j}{\text{Nancy}}{\text{DECIDE}, \textit{av}_j}$$	$$\text{SEND}_{\ell \to j}(\text{PROBE}, r)$$ $$\text{RECV}_{\ell \gets j}(\text{PREPARE}, r, \textit{ar}_j, \htmlClass{hl}{\textit{AV}_j})$$ $$\text{SEND}_{\ell \to j}(\text{PROPOSE}, r, \htmlClass{hl}{\textit{AV}_\ell})$$ $$\text{RECV}_{\ell \gets j}(\text{ACK}, r)$$ $$\SEND{\ell}{j}{\text{DECIDE}, r, \htmlClass{hl}{\|V\|}}$$ $$\SEND{j}{\text{Nancy}}{\text{DECIDE}, \htmlClass{hl}{\textit{DV}_j}}$$*	$$\text{SEND}_{\ell \to j}(\text{PROBE}, r)$$ $$\text{RECV}_{\ell \gets j}(\text{PREPARE}, r, \textit{ar}_j, \htmlClass{hl}{\textit{AV}_j})$$ $$\text{SEND}_{\ell \to j}(\text{PROPOSE}, r, \htmlClass{hl}{\textit{AV}_\ell})$$* $$\text{RECV}_{\ell \gets j}(\text{ACK}, r, \htmlClass{hl2}{\|\textit{AV}_j\|})$$* $$\SEND{\ell}{j}{\text{DECIDE}, r, \htmlClass{hl2}{\min w_k}}$$* $$\SEND{j}{\text{Nancy}}{\text{DECIDE}, \htmlClass{hl}{\textit{DV}_j}}$$*

Stars indicate repetition: repeated and Multi-Paxos send multiple Nancy-messages per server, not just one; and Multi-Paxos sends multiple PROPOSE, ACK, and DECIDE messages per round, not one per round.

Truncation

As we move from Paxos to Multi-Paxos, messages include more information about sequence lengths—first DECIDE, then ACK. This pattern continues with further optimizations. For instance, Multi-Paxos is seriously redundant: all slot values are sent with every PREPARE and PROPOSE message. Instead, servers could track each other’s knowledge of the decisions made so far. The leader’s PROBE messages could contain $$|\textit{DV}_\ell|$$, the length of its decided value sequence; a PREPARE message sent in response need only include the segment of $$\textit{AV}_j$$ beyond $$|\textit{DV}_\ell|$$. If non-leaders include their $$|\textit{DV}_j|$$ values in PREPARE and ACK messages, leaders can send tailored PROPOSE messages that include just the relevant suffix of the proposal value sequence $$V$$. Extension PROPOSE messages can be tailored even further: there’s no need to repeat a slot that a given non-leader has acknowledged in the same round via $$|\textit{AV}_j|$$ in some ACK. This information also supports log truncation. Once a server knows all servers have $$|\textit{DV}_k| \geq W$$, it can truncate its acknowledged value sequence anywhere before slot $$W$$; since all servers agree that the first $$W$$ values have been decided, the modified protocol will never include those values in a message. Leaders can compute $$W$$ from the $$|\textit{DV}_j|$$ values in PREPARE and ACK messages, and then share $$W$$ via PROPOSE and/or DECIDE messages.

Invariants

Here are some invariants about Multi-Paxos that a careful implementation might check.

$$|\textit{DV}_s|$$ increases monotonically.
The length of the value sequence in any non-ignored PREPARE or PROPOSE message must be $$\geq |\textit{DV}_s|$$.
Within a round, $$|\textit{AV}_s|$$ increases monotonically.
$$|\textit{AV}_s| \geq |\textit{DV}_s|$$.

Some others are inherited from single-decree Paxos, such as that $$\textit{pr}_s \geq \textit{ar}_s$$.

It is better to send the decision round rather than the current acknowledgement round, which might be later. In the decide phase, servers ignore DECIDE messages from unknown future rounds. ↩︎