Failure detectors: Chandra-Toueg consensus

The impossible proof

Fischer-Lynch-Paterson impossibility is both true and wildly unintuitive. Given that consensus is impossible with even one failure, how does Google even work?? Maybe Google can’t work and this is all a dream!

Well, Google can work, because the FLP result is founded on computation and communication models: deterministic computation and asynchronous reliable communication. Neither of these models seem aggressive or limiting at first glance, but in fact deterministic computation is quite strict. A deterministic server’s behavior can only depend on the messages it has received. It has no access to any other source of data or ground truth: no clock, no randomness, no nothing. And that limitation is the core of the impossibility proof.

If we allow other sources of data into the computation model, we can build consensus algorithms that survive failures. Two famous algorithms attain consensus with stop failures by using different oracles, meaning sources of data other than server-to-server communication. Chandra-Toueg consensus (link) introduces a kind of oracle called a failure detector; though unspecified in the paper, in most systems failure detectors are an abstraction of time. Ben-Or consensus (link) uses a different oracle, namely randomness. We’ll discuss Chandra-Toueg consensus first.

A cool thing about CT consensus, and Ben-Or consensus, is that although they aren’t practical—they send lots of messages and are designed for proofs, rather than for implementation gracefulness—they exhibit many of the features on which practical protocols like Paxos and Raft rely.

Problem statement

Chandra-Toueg consensus achieves consensus in a system of $$N$$ servers, communicating over asynchronous reliable point-to-point channels, where at most $$f = \lfloor (N-1)/2 \rfloor$$ are allowed to fail. Servers fail by stopping: they send no further messages. Consensus is achieved when all working (non-failed) servers send Nancy a color (termination). This color must be the same (agreement), and it must equal the initial color of at least one of the system servers, which includes servers that have since failed (validity). Each server has a failure detector component with properties described below.

In addition to $$f = \lfloor (N-1)/2 \rfloor$$, several steps of the consensus algorithm use a strict majority quorum $$Q = \lceil (N+1)/2 \rceil$$. (This is the smallest integer that is larger than $$N/2$$, so “more than $$N/2$$ messages” is equivalent to “at least $$Q$$ messages.”) For odd numbers of servers, $$Q = f + 1$$, but for even numbers of servers, $$Q = f + 2$$. (This is why real consensus systems typically deploy with odd numbers of servers; even numbers pay for an extra server but get no additional failure resilience.) In both cases, though, $$Q = N - f$$. This means it’s always possible to form a strict majority quorum with working servers.

Failure detectors

A failure detector is a per-server oracle that informs its server which other servers may have failed. Server $$s$$’s failure detector is modeled as a set of servers $$\mathscr{F}_s$$ that varies as the trace progresses. If $$j \in \mathscr{F}_s$$ at some point in the trace, that means that at that point, $$s$$ thinks $$j$$ has failed.

It’s not far off to think failure detectors are a fancy name for timeouts: server $$s$$ decides server $$j$$ has failed after a long time goes by with no messages from $$j$$. Real-world failure detectors are built on timeouts. There are often other components, such as general observations of network health, heartbeat messages (periodic messages that are sent even if nothing else is happening), and probabilistic reasoning; but timeouts are the foundation, because many different kinds of failure have timeouts as one observable effect.

Failure detectors break the deterministic computation model because they operate independently of message delivery. In deterministic computation, a server’s next action depends only on its previous SEND and RECV actions. We used this in the FLP proof when we showed that arbitrarily-long sequences of actions on different servers could be reordered while keeping behavior indistinguishable. But failure detectors break this. A failure detector can change whenever it wants. (Intuitively, this is based on a timeout, but in theoretical terms, it can change its mind for any reason.) This means we can’t make arguments based on trace reordering, because we can’t guarantee that reordered traces are indistinguishable: the failure detectors might produce different results.

The work by Chandra and Toueg that introduced failure detectors also introduced a taxonomy distinguishing detectors based on their completeness (how many failed servers are detected?) and accuracy (how many working servers are mistaken for failed?). The consensus algorithm below uses a specific kind of failure detector called the eventually strong failure detector, $$\OR{\Diamond \mathscr{S}}$$. This failure detector has strong completeness and eventual weak accuracy, which means:

(Strong completeness) There is a point in time after which every failed server is always $$\in \OR{\Diamond\mathscr{S}_s}$$ for every working server $$s$$.
(Eventual weak accuracy) There is a point in time after which some working server $$W$$ is never $$\in \OR{\Diamond \mathscr{S}_s}$$ for every working server $$s$$.

Other algorithms might need failure detectors with other properties, such as strong accuracy (every working server $$j$$ is never $$\in \OR{\Diamond \mathscr{S}_s}$$ for any working server $$s$$), but given how nondeterministic networked systems actually behave, weaker failure detectors are generally considered more realistic.

Aside: Chandra, Hadzilacos, and Toueg showed that any failure detector that solves consensus with sub-majority failures (i.e., $$f = \lfloor (N-1)/2 \rfloor$$) can be modeled by a weaker detector, $$\OR{\Diamond \mathscr{W}}$$, which has weaker completeness. In $$\OR{\Diamond \mathscr{S}}$$, all working servers are eventually aware of all failures; but in $$\OR{\Diamond \mathscr{W}}$$, that knowledge might be split up: there is a point in time after which every failed server $$s$$ has at least one working server $$j$$ that always has $$s \in \OR{\Diamond \mathscr{W}_j}$$. They show equivalence by building stronger failure detectors from $$\OR{\Diamond\mathscr{W}}$$ using distributed algorithms. But the CT consensus algorithm, as specified, requires $$\OR{\Diamond \mathscr{S}}$$.

CT consensus

CT consensus operates in rounds. In each round, a predetermined server is distinguished as leader (or coordinator). Leaders and non-leaders communicate differently; non-leaders send two messages and receive one, whereas leaders receive $$\approx 2N$$ messages and send $$\approx N$$.

Server $$s$$ is...	Messages	Notes
Non-leader with working leader	$$\text{SEND}_{s \to \ell}(\text{PREPARE}, r, k_s, \textit{kr}_s)$$ $$\text{RECV}_{s \gets \ell}(\text{PROPOSE}, r, k_\ell)$$ $$\text{SEND}_{s \to \ell}(\text{ACK}, r, \text{true})$$
Non-leader with failed leader	$$\text{SEND}_{s \to \ell}(\text{PREPARE}, r, k_s, \textit{kr}_s)$$ $$\text{SEND}_{s \to \ell}(\text{ACK}, r, \text{false})$$
Leader $$\ell$$	$$\forall j, \text{RECV}_{\ell \gets j}(\text{PREPARE}, r, k_j, \textit{kr}_j)$$ $$\forall j, \text{SEND}_{\ell \to j}(\text{PROPOSE}, r, k_\ell)$$ $$\forall j, \text{RECV}_{\ell \gets j}(\text{ACK}, r, \textit{ok}_j)$$	Although up to $$N$$ PREPARE and ACK messages are sent to the leader, it moves on after receiving a majority quorum.
Decider	$$\text{SEND}_{s \to \text{Nancy}}(\text{DECIDE}, k_s)$$ $$\forall j, \text{SEND}_{s \to j}(\text{DECIDE}, k_s)$$	The first decision is triggered by the leader at the end of some round, but deciding can interrupt later rounds at any time.

In detail, the algorithm works as follows.

Servers

As before, we have consensus servers $$1 \leq s \leq N$$, plus a distinguished initializer, 0, and a distinguished recorder, Nancy. CT consensus allows up to $$f = \lfloor (N-1)/2\rfloor$$ of the consensus servers to fail. If $$N\leq 2$$ none of them can fail; if $$3 \leq N \leq 4$$, one can fail; if $$5 \leq N \leq 6$$, two can fail; and so forth. 0 and Nancy never fail.

Unexpected messages

In some phases of the protocol, servers wait for specific kinds of message. For example, in step 3, the round leader waits to receive a quorum of PREPARE messages. As it waits, it may receive unexpected messages, including old messages from previous rounds, or even messages from future rounds. Here’s how unexpected messages are handled:

Messages from previous rounds, or previous phases of the current round (e.g., PREPARE messages when the server is waiting for ACKs), are dropped.
Messages from future rounds, or future phases of the current round, are saved and processed at the relevant time.
$$\langle \text{DECIDE}, k \rangle$$ messages cause the receiving server to immediately decide on color $$k$$. This DECIDE shortcut follows these steps:
1. The server sends a DECIDE message to Nancy: $$\text{SEND}_{s \to \text{Nancy}}(\text{DECIDE}, k)$$.
2. It forwards the DECIDE message to all other consensus servers: $$\forall j \neq s : 1 \leq j \leq N$$, $$\text{SEND}_{s \to j}(\text{DECIDE}, k)$$.
3. It terminates, responding to no future messages.

Initialization phase

The initializer sends an INIT message to each consensus server including $$N$$ and the server’s initial color $$k_s$$. Each consensus server also maintains two more variables, a round number $$r_s$$ and a color round $$\textit{kr}_s$$, which are initially zero.
Each server waits to receive that INIT message.

Prepare phase

The remaining parts of the protocol execute in rounds. In a round’s prepare phase, servers select the current leader and send that leader their current color.

Each server $$s$$ increments its round, $$r_s \gets r_s + 1$$, and selects a leader for this round, $$\ell = (r_s \bmod N) + 1$$. Note that all servers in a round select the same leader.
Each server $$s$$ sends a PREPARE message to the current leader including the round, its color, and its color round: $$\text{SEND}_{s \to \ell}(\text{PREPARE}, r_s, k_s, \textit{kr}_s)$$. (The leader does this too, sending a message to itself. This simplifies the algorithm description, but a real implementation could optimize it.)
The leader $$\ell$$ (that is, the server whose ID $$s = \ell = (r_s \bmod N) + 1$$) waits to receive more than $$N/2$$ PREPARE messages for round $$r_\ell$$.
1. Of these, it selects some message $$M_j = \langle \text{PREPARE}, r_\ell, k_j, \textit{kr}_j \rangle$$ with the maximum $$\textit{kr}_j$$.
2. It then sets $$k_\ell \gets k_j$$ and $$\textit{kr}_\ell \gets \textit{kr}_j$$.

Propose phase

Next comes the propose phase, where the leader forwards their proposed consensus to the other servers.

The leader $$\ell$$ sends a PROPOSE message to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\text{SEND}_{\ell \to j}(\text{PROPOSE}, r_\ell, k_\ell)$$.
All servers wait to receive a PROPOSE message from the leader, or for their failure detector to report that the leader has failed, whichever comes first.
1. If server $$s$$ sees $$\text{RECV}_s(\text{PROPOSE}, r_s, k_\ell)$$ first, then $$s$$ sets $$k_s \gets k_\ell$$ and $$\textit{kr}_s \gets r_s$$ and acknowledges the leader: $$\text{SEND}_{s \to \ell}(\text{ACK}, r_s, \text{true})$$.
2. On the other hand, if $$\ell \in \OR{\Diamond \mathscr{S}_s}$$ becomes true first, then server $$s$$ chastises the leader: $$\text{SEND}_{s \to \ell}(\text{ACK}, r_s, \text{false})$$.
Either way, server $$s$$ moves on to the next phase.

This phase is the only place the failure detector appears, to detect a potential leader failure. It’s also the only place where a failure detector is needed. In the other phases, only the distinguished round leader calls RECV, and it waits for messages from an arbitrary majority of servers. Since CT consensus requires a majority of servers to keep working, and since all sent messages are eventually delivered, some majority of messages will always eventually arrive. But in the propose phase, servers wait to receive a message from one specific server, namely the round leader. Servers without a failure detector would get stuck here forever if the round leader happened to fail.

Acknowledge phase (leader only)

In the acknowledge phase, the leader sees if they have established consensus by successfully proposing to a majority of servers.

The leader $$\ell$$ waits to receive more than $$N/2$$ ACK messages for round $$r_\ell$$.
1. If more than $$N/2$$ of these messages are positive acknowledgements ($$\text{RECV}_\ell(\text{ACK}, r_\ell, \text{true})$$), then the server decides, using the steps from the DECIDE shortcut.
2. Otherwise, it cannot decide; it gives up leadership and moves on to the next round.

Next round

Finally, all servers (except for ones that have decided) go back to the prepare phase and move to the next round, choosing a new leader. The algorithm continues until all working servers have decided.

Correctness: Convergence

Why does this algorithm work? There are some odd things about it—timeouts aren’t explicitly mentioned, and it seems like servers’ rounds could get spread arbitrarily far apart. But it does work, thanks to properties of the failure detector, and to properties of majorities. We first focus on convergence—the fact that at some point, no matter what weird failures occur or are imagined, all working servers will communicate in the same round, resulting in a decision.

Again, an eventually strong failure detector gives us:

(Strong completeness) There is a point in time after which every failed server is always $$\in \OR{\Diamond \mathscr{S}_s}$$ for every working server $$s$$.
(Eventual weak accuracy) There is a point in time after which some working server $$W$$ is never $$\in \OR{\Diamond \mathscr{S}_s}$$ for every working server $$s$$.

So let’s pick that point in time. At that point, the working servers might initially be in different rounds, but we can show all of them will eventually reach the same propose phase and wait to receive a PROPOSE message from some working leader $$W$$. This follows directly from the failure detector requirements.

The argument goes like this:

Thanks to strong completeness, no server gets stuck waiting for a failed leader. Every failed server shows up in every working server’s $$\OR{\Diamond \mathscr{S}_s}$$, so every working server will eventually give up on receiving that PROPOSE and will move on to the next round.
Servers never get stuck waiting for working leaders thanks to reliable delivery.
Also thanks to reliable delivery, no working leader gets stuck in a round. Leaders wait for messages, not failure; messages are always delivered eventually; and every working server always sends each round’s leader a message, either a positive or negative acknowledgement, so at least $$Q$$ ACK messages will always arrive.

(The negative acknowledgements are critical, since they let a working leader move on to the next round even if other servers mistook it for failed. Would you have sent them, if you didn’t know the protocol? They do seem a little weird—the algorithm thinks the relevant leader has failed, so why bother sending them a message?)

Thus, once the critical time point has passed, working non-leaders and working leaders will keep advancing rounds until they decide and terminate the algorithm. (Before that point, some working servers might get stuck waiting for a server that they don’t yet know has failed.)

The cycle of leadership, which traverses all the servers, will soon reach a server $$W$$ that all the working servers know is working. This server must exist thanks to eventual weak accuracy. When leadership arrives there:

Every PREPARE message sent to that leader will be received (asynchronous reliable communication).
Every PROPOSE message it sends to the non-leaders will also be delivered (asynchronous reliable communication).
The non-leaders will wait to receive that PROPOSE message rather than moving on (eventual weak accuracy), so they will respond with a positive acknowledgement.
Every positive acknowledgement will be received (asynchronous reliable communication), and no negative acknowledgements will be received (fail-stop behavior).¹
This makes at least $$Q$$ positive acknowledgements, so leader $$W$$ will decide!

Correctness: Majority

OK, so servers eventually decide. But what do they decide? Can we show that, despite failures, any two servers that do decide end up deciding the same thing?

The key property that enforces a single decision is overlapping quorums. Any server that makes a decision does so based on input from a strict majority of servers (that is, more than $$N/2$$ servers). But an important property of majorities is that any two majorities of a set must overlap: If $$A \subset X$$, $$B \subset X$$, $$\lvert A\rvert > \lvert X\rvert/2$$, and $$\lvert B\rvert > \lvert X\rvert/2$$, then $$A \cap B \neq \emptyset$$. (This is a counting argument: if the two subsets didn’t overlap, then $$X$$’s size would be bigger than itself!) Any future decision, including a decision that involved servers who didn’t participate in the first decision, must take into account information from the first decision, because at least one server was involved in both.

So let’s say in round $$R$$ some decision was made: at least $$Q$$ servers (a majority) set their $$\textit{kr}_s \gets R$$. (That’s the “ACK true” path.) But that means that the color set in round $$R$$ will always win. In the next round, the leader will choose the PREPARE message with the maximum $$\textit{kr}$$; but it waits for at least $$Q$$ PREPAREs before continuing, which means it will see at least one PREPARE with $$\textit{kr} = R$$, and will adopt that PREPARE’s color as its own. Thus, the decided color will always equal the PROPOSEd color for all future rounds.

Correctness: Distribution

Finally, correctness requires that all working servers eventually decide. Question for the curious: How does the re-forwarding of DECIDE messages in the DECIDE shortcut achieve this? This document is already pretty long :)

Fail-stop behavior—Eventual weak accuracy only says that the working servers agree that $$W$$ is working. In a different failure model (for instance, imagine a network partition where only one direction of messages was dropped), failed servers might still be around sending messages, and they might mistakenly think $$W$$ had failed; if one of their misguided negative acknowledgements arrived first, $$W$$ might give up without deciding. But we are in the fail-stop failure model, so failed servers send nothing. ↩︎