Consensus

The consensus problem

Let’s play a game of consensus (or agreement).

Everyone in the front row is a server in a distributed system. They can send messages to each other and to a special distinguished node called Nancy. Here is their task:

Each server is given an initial color, either red or blue.
Each server must eventually send a message to Nancy with a chosen color. We call this deciding on a color. (This is the termination property.)
All servers must decide the same color. (This is the agreement property.)
If all servers have the same initial color, they must decide on that color. Otherwise they can choose either color—but they all must agree. (This is the unanimity property.)

Can you develop an algorithm that solves this consensus problem?

Incorrect algorithms

What’s wrong with these trivial algorithms for this problem?

Every server sends Nancy the color red.

This algorithm violates the unanimity property. It will decide the wrong color in one specific case: when every server is initially blue.
Every server sends Nancy their initial color.

If the servers have different initial colors, then Nancy will receive more than one decision—there is no consensus. This violates the agreement property.
Every server twiddles its thumbs and never sends a message.

The servers must eventually decide; this algorithm does not terminate. But at least it never decides incorrectly!

Computation and communication models

Before we develop an algorithm that does work, we need to know what model that algorithm will use.

Distributed algorithms are written for specific models of computation and of the network. For instance, an algorithm might assume the network always delivers all messages. It would almost certainly fail on an unreliable network, which is allowed to drop messages!

Today we’ll use a deterministic computation model and an asynchronous, reliable, point-to-point communication model. Here’s what that means.

Deterministic: Every action taken by a server depends only on the prior messages it received. Servers get information from the outside world only in the form of messages. For instance, servers have no access to a notion of time, and they have no access to sources of randomness. The only source of nondeterminism in the system is scheduling order (the order in which servers send messages, and the order in which sent messages are delivered to receive requests).

Point-to-point: Each message is sent from one server to another specific server, and can only be received by its intended recipient.

Asynchronous: Messages can be delivered out of order and may be delayed. This means that a server’s receive request might fail, returning “no message available”, even if a message for that server was waiting to be received. However, messages are never delayed indefinitely; if a server $$n$$ calls the receive action enough times, every waiting message will eventually be delivered.

Reliable: Sent messages are never dropped; every sent message can eventually be received exactly once.

How realistic is this model? Only sort of.

Point-to-point messaging is simply how distributed systems work. Multicast and broadcast messaging primitives (one sender, many receivers) are common in local networks and at lower levels in the networking stack (such as for service discovery), but they are essentially absent from upper layers.
Real-world networks are partially synchronous (messages cannot be delayed indefinitely). This is a nice property achieved in a not-nice way, namely by being unreliable. Internet standards limit the amount of time a packet (i.e., message or segment) may be considered “alive” in the network; for instance, TCP sets the maximum segment lifetime (MSL) to 2 minutes. Networking components are expected to drop any messages that are too old to be delivered.
Real-world networks are not reliable. In addition to dropping messages, networking technologies are allowed to duplicate messages. Higher-level protocols such as TCP (and, indeed, application-level protocols, such as connection restart timers) can recover some reliability from the underlying network model; TCP ensures that application messages are delivered in order and at most once. That said, underlying networks can and do drop messages.

So this model is both easier and harder to deal with than real-world networks. It’s easier because communication is reliable, but harder because server operation is totally deterministic and messages may be arbitrarily (but not indefinitely) delayed.

Traces

For reasoning purposes, we can identify the state of a distributed computation by a trace of its operations, which means a history of the execution steps of its servers.

A trace $$T$$ is an ordered sequence of operations $$T[0]\dots T[n]$$, each of which indicates a server either sending a message, receiving a message previously sent, or attempting to receive and getting nothing back. The operations are written:

$$\text{SEND}_{i \to j}(M)$$: Send message $$M$$ from server $$i$$ to server $$j$$
$$\text{RECV}_i(M)$$: Receive message $$M$$ at server $$i$$
$$\text{RECV}_i(-)$$: Fail to receive a message at server $$i$$

A trace may indicate a full algorithm execution (ending when the last server sends its last message to Nancy) or a prefix of that execution. We assume that each message is unique. Each server can start with special initial state; we model that with a distinguished initialization server, server 0, that starts every trace by sending initialization messages to each computation server. (We require that the network deliver the initialization message before it delivers any other message.)

Some notation is useful. $$T \cdot \textit{op}$$ and $$T \cdot \sigma$$ signify trace concatenation (adding an operation, or sequence of operations, to the end of the trace). $$T \mid i$$ represents restriction to a specific server: $$T \mid i$$ contains, in order, those operations of $$T$$ that were performed by server $$i$$ (so all $$\text{SEND}_i$$ and $$\text{RECV}_i$$ operations). Furthermore, if $$\mathscr{T}$$ be the set of all valid traces for some distributed algorithm, then $$\mathscr{T}$$ is prefix-closed: $$T \cdot \sigma \in \mathscr{T}$$ implies that $$T \in \mathscr{T}$$.

Trace validity can be analyzed based both on algorithm and computation model and on network model. For instance, in the reliable point-to-point communication model, whenever a trace contains a message receipt, then that message was previously sent and hasn’t yet been received. (So if $$T \cdot \text{RECV}_i(M) \in \mathscr{T}$$, then $$\text{SEND}_{j \to i}(M) \in T$$ for some $$j$$, and furthermore $$\text{RECV}_i(M) \not\in T$$.¹) Furthermore, we model asynchronous communication, and the distributed nature of the system, by allowing servers to execute in any order, subject to message restrictions. Whenever $$\textit{op}$$ and $$\textit{op}'$$ are operations on different servers that involve different messages, and $$T \cdot \textit{op} \cdot \textit{op}' \in \mathscr{T}$$, then $$T \cdot \textit{op}' \cdot \textit{op} \in \mathscr{T}$$ as well.

Deterministic computation has strong effects on trace validity. Deterministic algorithms only interact with the outside world through messages, so a server that has received a given set of messages always takes the same next step—the server’s execution can diverge only after receiving a different message. In trace terms, say $$T, T' \in \mathscr{T}$$ have identical restricted prefixes for some server: $$T \mid i = \sigma \cdot \textit{op}$$ and $$T' \mid i = \sigma \cdot \textit{op}'$$. Then $$\textit{op}$$ and $$\textit{op}'$$ must be compatible: either $$\textit{op} = \textit{op}'$$, or $$\textit{op}$$ and $$\textit{op}'$$ are $$\text{RECV}_i$$ operations that receive different messages (or fail differently).

Furthermore, when server behavior is deterministic, traces obtained by reordering operations involving different messages are indistinguishable in that no server can tell the difference between them. Technically, we say that traces $$T$$ and $$T'$$ are indistinguishable if, for any suffix $$\sigma$$,

\[ T \cdot \sigma \in \mathscr{T} \iff T' \cdot \sigma \in \mathscr{T}. \]

Consider operations $$\textit{op}$$ and $$\textit{op}'$$ on different servers that involve different messages (i.e., they aren’t a SEND and a RECEIVE for the same message). Then for any trace $$T$$ of a deterministic algorithm, $$T \cdot \textit{op} \cdot \textit{op}'$$ and $$T \cdot \textit{op}' \cdot \textit{op}$$ are indistinguishable. (The two traces look identical from any server’s point of view: at any server $$i$$, $$(T \cdot \textit{op} \cdot \textit{op}') \mid i = (T \cdot \textit{op}' \cdot \textit{op}) \mid i$$. And in deterministic computation, a server’s next operation is always totally dependent on its history.) This swap operation can be repeated to indistinguishably reorder whole segments of operations.

By-name consensus

We suggested this algorithm: choose the color of the server whose name is first in alphabetical order (assuming distinct names).

Servers

For an instance of consensus with $$N$$ servers, we reserve the following IDs:

IDs $$1 \leq i \leq N$$ are the consensus servers (or just “servers” for short).
ID 0 is the initializer. It sends INIT messages to the consensus servers at the start of the protocol.
ID Nancy is the recorder. It receives DECIDE messages that indicate server decisions.

Modeling initialization and decision recording as explicit servers lets us simplify some descriptions. For instance, consensus servers don’t have any initial state outside of the message stream: the INIT messages give them their initial state. Nancy and the initializer never fail, and we assume, without loss of generality, that consensus servers always receive their INIT messages first. (The technical requirement is that traces differing only in INIT receive order are indistinguishable. We could enforce this in the algorithm by enqueuing messages received before INIT and processing them later.)

Initialization phase

The initializer sends an INIT message to each consensus server including $$N$$, the number of servers; the receiver’s name $$\text{ID}_i$$; and its initial color $$k_i$$.
Each consensus server waits until it receives that INIT message.

Broadcast phase

Each server $$i$$ sends a PROPOSE message to every server, including itself, including its ID and initial color: $$\forall j \in \{1,\dots,N\}$$, $$\text{SEND}_{i \to j}(\text{PROPOSE}, \text{ID}_i, k_i)$$.

Resolution phase

Each server $$i$$ receives $$N$$ PROPOSE messages.
Of these, it selects the message $$M_j = \langle \text{PROPOSE}, \text{ID}_j, k_j\rangle$$ with the minimum ID.
It decides on the corresponding color and sends it to Nancy: $$\text{SEND}_{i \to \text{Nancy}}(\text{DECIDE}, k_j)$$.

Example

Here’s an example trace of this algorithm with $$N = 2$$.

$$\text{SEND}_{0 \to 1}(\text{INIT}, 2, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{0 \to 2}(\text{INIT}, 2, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_1(\text{INIT}, 2, \text{Alice}, \RD{\text{red}})$$
$$\text{RECV}_2(\text{INIT}, 2, \text{Arathi}, \BU{\text{blue}})$$
$$\text{SEND}_{1 \to 1}(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{1 \to 2}(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{2 \to 1}(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{SEND}_{2 \to 2}(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_1(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{RECV}_1(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_2(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{RECV}_2(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{SEND}_{1 \to \text{Nancy}}(\text{DECIDE}, \RD{\text{red}})$$
$$\text{SEND}_{2 \to \text{Nancy}}(\text{DECIDE}, \RD{\text{red}})$$

And here is an equivalent indistinguishable trace, obtained by successive swaps.

$$\text{SEND}_{0 \to 1}(\text{INIT}, 2, \text{Alice}, \RD{\text{red}})$$
$$\text{RECV}_1(\text{INIT}, 2, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{1 \to 1}(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{1 \to 2}(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{0 \to 2}(\text{INIT}, 2, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_2(\text{INIT}, 2, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_1(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{SEND}_{2 \to 1}(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_1(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{SEND}_{1 \to \text{Nancy}}(\text{DECIDE}, \RD{\text{red}})$$
$$\text{SEND}_{2 \to 2}(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{RECV}_2(\text{PROPOSE}, \text{Alice}, \RD{\text{red}})$$
$$\text{RECV}_2(\text{PROPOSE}, \text{Arathi}, \BU{\text{blue}})$$
$$\text{SEND}_{2 \to \text{Nancy}}(\text{DECIDE}, \RD{\text{red}})$$

Analyzing the by-name algorithm

This algorithm works! Because all messages are eventually received and servers don’t fail, all servers will eventually receive the message from the server with the alphabetically first name. They then perform the same computation (sorting by name) and decide upon the same color.

But what about failures? That’s another story.

We assume without loss of generality that messages are always distinguishable—maybe they have sequence numbers used only for uniqueness. ↩︎