PBFT

This page describes a version of the Practical Byzantine Fault Tolerance protocol for Byzantine-fault-tolerant sequence consensus. Compared to the published version, it is more explicit, less optimized, and uses message terminology familiar to you from multi-Paxos.

Byzantine split-brain

Real-world failure behaviors go far beyond fail-stop. Byzantine fault tolerance is attractive because it handles any kind of failure, as long as $$\lceil (2N+1)/3\rceil$$ servers remain working. Nevertheless, many specific failure classes can be fixed ad hoc within a fail-stop framework, and the expense of full Byzantine fault tolerance (e.g., at least $$3f+1$$ servers, higher message authentication requirements) makes ad hoc solutions preferable when they suffice. For instance, a cosmic ray can flip a bit in a message in transmission; though this can cause a leader to propose different values to different servers—Byzantine behavior—checksums can catch unintentional bit flips with very high probability. If you’re only worried about cosmic rays, use checksums, not BFT.

But some failures require full BFT. The worst is Byzantine split-brain, where faulty servers intentionally send different messages to different participants to induce agreement violations. Considering how multi-Paxos might behave under Byzantine split-brain demonstrates this problem and motivates the design of PBFT.

Commitments must be shared

Paxos followers simply accept the leader’s proposals, so a Byzantine-split-brained leader could send different proposals to different participants, then tell them all to decide. Followers must confer to prevent this attack: no server can decide on a value until it has evidence of commitment from $$2f$$ other servers for the same value.

But network delays make a single commitment exchange insufficient. Let’s consider a seven-server network ($$f=2$$) with a faulty leader (server 1); we’ll first focus on what non-faulty servers 2–5 can see.

  1. The faulty leader proposes red to servers 2 and 3, and blue to servers 4 and 5.
  2. Servers 1–5 receive red witnesses from servers 2 and 3. (The messages to servers 6 and 7 are delayed.)
  3. Servers 1–5 receive blue witnesses from servers 4 and 5.

Servers 2 and 3 thus have seen three red and two blue messages, while servers 4 and 5 have seen three blue and two red. Servers 2–5 have no reason to prefer either color.

Now consider two scenarios. In one, server 6 fails by stopping; in the other, server 7 is faulty, but messages from server 6 are dropped (it may be temporarily partitioned).

Server 6 fails: A new leader is elected by servers 2, 3, 4, 5, and 7 (server 1 can be excluded because after conferring, the others can tell it is faulty). Since no color is a majority, this working quorum could decide either color.

Server 7 is faulty, but server 6’s outgoing messages are dropped: Faulty servers 1 and 7 send server 6 a red message. Once the other witnesses arrive, server 6 has received a total of five red messages (from 1, 2, 3, itself, and 7), but no other server has heard from it.

The kicker is that servers 2–5 cannot distinguish these cases! In the first case, 2–5+7 form a quorum of working servers; to satisfy liveness, they must come to some decision on the slot in question—and nothing prevents them from choosing blue. But in the other case, server 6 has received $$2f+1$$ commitments on a red value. If a single set of messages sufficed for a decision, server 6 could then decide red—while in the meantime, the other servers, thinking server 6 had failed (and encouraged by faulty server 7), decided blue, causing disaster.

Since one message exchange among a quorum is insufficient to decide a value, PBFT uses two exchanges. The first of these exchanges uses a new message type called CONFER1; the second exchange uses the existing ACK messages.

New rounds must be validated

Round changes introduce a related problem. Within a multi-Paxos round, servers ensure that their acknowledged value sequences only grow; but on entering a new round, the new leader replaces its followers’ previous logs, possibly changing the values acknowledged for some slots or shrinking the number of slots. Multi-Paxos needs these features for normal split-brain scenarios, but a malicious leader could use them to undo slots that at least one server had already decided, violating agreement.

Solving this problem requires special care around round changes. Servers must validate the proposed value sequences themselves, and then confer with peers: $$2f+1$$ servers must agree that nothing is missing. This could be done by exchanging messages; PBFT reduces the message load by using signatures.

Signatures

The expression $$H(M)$$ denotes a cryptographic hash (e.g., SHA-384) of message $$M$$. Anyone can compute $$H(M)$$ given $$M$$, but the message $$M$$ cannot be recovered from $$H(M)$$; in fact, it is computationally hard to find any hash collision (which is a value $$X$$ with a matching hash $$H(X) = H(M)$$).

The expression $$\text{Sign}_j(M)$$ denotes a digital signature of message $$M$$ by server $$j$$. A digital signature verifies the authenticity of a message: only server $$j$$ can compute $$\text{Sign}_j(M)$$, but other servers can validate this signature using public-key cryptography. Digital signatures also offer non-repudiation: the existence of $$\text{Sign}_j(M)$$ indicates that $$M$$ was signed by $$j$$. Note that the message $$M$$ cannot be recovered from $$\text{Sign}_j(M)$$.

The PBFT protocol features three distinct kinds of signature.

PBFT also tracks tagged signatures, which annotate a signature with its signing server and round. A tagged signature is a tuple $$\langle j, r, \sigma\rangle$$, where $$j$$ is the signing server, $$r$$ is the associated round, and $$\sigma$$ is the signature.

The PBFT protocol

Initialization phase

Each consensus server $$s$$ maintains the five variables familiar from multi-Paxos:

It also maintains two new variables called signature logs. Each signature log is a sequence of entries, one per acknowledged value slot, where each entry is a set of tagged signatures. Each signature in slot $$i$$ of a signature log on server $$s$$ signs the slot number $$i$$ and the acknowledged value $$\textit{AV}_s[i]$$.

Given a set of tagged signatures, we define a cohort as a subset that is tagged with the same round number $$r$$, and a quorum cohort as a cohort that contains signatures from at least $$2f+1$$ distinct servers.

The notation $$A \sqsubseteq B$$ means sequence $$A$$ is a prefix of sequence $$B$$: $$|A| \leq |B|$$, and for all $$0 \leq i < |A|$$, $$A[i] = B[i]$$.

Probe phase

In the probe phase, a new leader picks a new round number larger than any it has heard before and broadcasts that round number to the consensus servers. The probe round number must be unique to this leader.

  1. The leader $$\ell$$ selects a unique round number $$r > \textit{pr}_\ell$$.

  2. The leader sends PROBE to all servers including itself: $$\forall j : 1 \leq j \leq N$$, $$\SEND{\ell}{j}{\text{PROBE}, r}$$.

  3. When server $$s$$ receives $$\langle \text{PROBE}, r\rangle$$ from some leader $$\ell$$, it responds with a PREPARE.

    If $$r < \textit{pr}_s$$ or $$r = \textit{ar}_s$$, this message is old and should be ignored.

    Otherwise, server $$s$$ sets $$\textit{pr}_s \gets r$$ and responds with its acknowledgement round, acknowledged value sequence, and proposal signature log. It also signs the proposal signature log, allowing other servers to validate that the log is intact even after it is forwarded by the leader.

    \[ \SEND{s}{\ell}{\text{PREPARE}, \textit{pr}_s, \textit{ar}_s, \textit{AV}_s, \textit{PSL}_s, \text{Sign}_s(\text{PSL}, \textit{pr}_s, H(\textit{PSL}_s))} \]

Additional care is required to prevent a faulty server from preventing progress by sending escalating PROBEs; see below for fixes.

Responsible server algorithm

Multi-Paxos picks a new value sequence from among its received PREPAREs by looking for the largest round and then the longest sequence. PBFT complicates this in two ways. First, a faulty server could suggest values that conflicted with already-agreed slots; and second, followers must validate any work the leader does. To address these challenges, PBFT constructs and validates new-round value sequences from a quorum of server proposal signature logs. Signatures ensure that faulty leaders cannot hide information about decided slots from servers in the next round.

To handle these proposal signature logs, PBFT uses a subroutine called the responsible server algorithm. This algorithm takes as input a set of at least $$2f+1$$ distinct proposal signature logs $$\textit{PSL}_j$$, each tagged with the server who sent it, and a slot $$i$$. It then determines the responsible server for that slot, which is the server whose value should be used for that slot. When constructing a sequence, the leader chooses the responsible server’s value; when validating a sequence, a follower verifies the leader’s proposed value using the signatures in the responsible server’s PSL slot.

To find the server responsible for slot $$i$$:

  1. Collect the quorum cohorts from slot $$i$$ of the $$\textit{PSL}$$ logs. If any exist, pick some server $$j$$ so that the quorum cohort in $$\textit{PSL}_j$$ has the maximal acknowledgement round. Server $$j$$ is responsible for slot $$i$$. (The choice doesn’t matter; all quorum cohorts with the same acknowledgement round overlap on at least one non-faulty server, so must have the same value.)

  2. Otherwise, no value has been decided for slot $$i$$. (A slot can be decided only after $$2f+1$$ servers acknowledge it, and an acknowledgement requires $$2f+1$$ proposal signatures, so any decided slot will have a quorum cohort in at least one of our $$\textit{PSL}$$s.) It would be safe to null the slot2, but for symmetry with multi-Paxos, we prefer to choose some already-proposed value when one exists. (A faulty server could make up a proposed value, but this is possible anyway for faulty leaders.) That value must be unambiguously determined to allow validation; any deterministic rule would work.

    Let $$\textit{mr}$$ be the maximal acknowledgement round found in slot $$i$$ of any $$\textit{PSL}$$ log, and let $$j$$ be the minimal server ID having an $$\textit{mr}$$ cohort in slot $$i$$ of $$\textit{PSL}_j$$. If $$j$$ exists, then server $$j$$ is responsible for slot $$i$$.

  3. Otherwise, no proposal signatures exist for slot $$i$$, and no server is responsible for slot $$i$$.

Propose phase

  1. The leader $$\ell$$ waits to receive valid $$\langle \text{PREPARE}, r, \textit{ar}_j, \textit{AV}_j, \textit{PSL}_j, \sigma^{\text{PSL}}_j\rangle$$ messages from a quorum of at least $$2f+1$$ servers that includes itself.

    A valid PREPARE message has the following properties:

    • $$|\textit{PSL}_j| = |\textit{AV}_j|$$.

    • $$\textit{ar}_j < r$$.

    • $$\sigma^{\text{PSL}}_j$$ is a valid $$j$$-signature for $$r$$ and $$\textit{PSL}_j$$.

    • Each tagged signature in $$\textit{PSL}_j$$ is valid and compatible with $$\textit{ar}_j$$ and $$\textit{AV}_j$$. (Any tagged signature $$\langle j', r', \sigma^p \rangle \in \textit{PSL}_j[i]$$ must have $$r' \leq \textit{ar}_j$$ and $$\sigma^p$$ a valid $$j'$$-signature for $$\langle \text{PROPOSE}, r', i, H(\textit{AV}_j[i]) \rangle$$.)

    If $$r < \textit{pr}_\ell$$, then $$\ell$$’s leadership has been usurped; it gives up on its propose phase.

    Otherwise, $$\ell$$ chooses some length $$L$$ at least as large as the longest acknowledged value sequence in the PREPAREs. It then constructs a new value sequence $$V$$ using the responsible server algorithm. For each slot $$0 \leq i < L$$:

    • If server $$j$$ is responsible for slot $$i$$ according to the responsible server algorithm, the leader sets $$V[i] \gets \textit{AV}_j[i]$$.

    • Otherwise, no proposal signatures exist for slot $$i$$. It sets $$V[i] \gets \textit{IV}_\ell[i]$$.

  2. The leader $$\ell$$ sends a PROPOSE message for $$V$$ to all servers including itself.

    Since this message is for a new round, it must also include the commitments that justify its new value sequence. That is, it includes $$\textit{PSLS}$$, the set of tagged proposal signature logs from all PREPAREs with signatures:

    \[ \textit{PS} = [\text{Sign}_\ell(\text{PROPOSE}, \textit{pr}_\ell, i, H(V[i])) \mid 0 \leq i < |V|]; \\ \textit{PSLS} = \{\langle j, \textit{PSL}_j, \sigma^{\text{PSL}}_j\rangle \mid j \text{\ sent a PREPARE for this round}\}; \\ \forall j : 1 \leq j \leq N,\ \SEND{\ell}{j}{\text{PROPOSE}, \textit{pr}_\ell, V, \textit{PS}, \textit{PSLS}} \]

    Note that a non-faulty leader sends the same message to all servers, but a faulty leader can equivocate. It might send different subsets of $$\textit{PSLS}$$ to different followers; faulty servers might also publish multiple mutually-inconsistent PSLs under a single signing identity for the faulty leader to choose among. This can delay consensus, but cannot prevent it: length checks and witness exchanges ensure that no decided slot will ever be overwritten.

  3. When server $$s$$ receives a $$\langle \text{PROPOSE}, r, V, \textit{PS} [, \textit{PSLS}]\rangle$$ message from some leader $$\ell$$, it checks it for validity, applies its values, and responds by broadcasting a CONFER message.

    A valid PROPOSE message has a current or future round, a reasonable length, and valid signatures, and it agrees with previous messages.

    • If $$r < \textit{pr}_s$$, the message is old. It is ignored.

    • If $$r = \textit{ar}_s$$ and $$|V| < |\textit{AV}_s|$$, the message is old. It is ignored.

    • If $$r = \textit{ar}_s$$ and $$\textit{AV}_s \not\sqsubseteq V$$, then the leader is malingering and changing its proposals. The message is ignored.

    • If $$r > \textit{ar}_s$$ and the message does not contain $$\textit{PSLS}$$, then this server cannot validate a new leader’s first proposed value sequence. The message is ignored.

    • If $$r > \textit{ar}_s$$ and $$|V| < |\textit{DV}_s|$$, then a new leader is attempting to undo some decided slots. It must be faulty; the message is ignored.

    • If $$r > \textit{ar}_s$$ and $$\textit{DV}_s \not\sqsubseteq V$$, then the leader has faultily changed some decided slots. The message is ignored.

    • If any of the signatures in $$\textit{PS}$$ are invalid, the message is corrupted. It is ignored.

    • If the message contains $$\textit{PSLS}$$ and $$\textit{PSLS}$$ is invalid (it has fewer than $$2f+1$$ entries from distinct servers, or any of the $$\sigma^{\text{PSL}}$$ signatures are wrong for the given logs and servers and the message’s round $$r$$), the message is corrupted. It is ignored.

    • If the message contains $$\textit{PSLS}$$, the receiver uses it to validate the proposed value sequence $$V$$. For each slot $$i$$, run the responsible server algorithm on that slot using $$\textit{PSLS}$$. If the algorithm says no server is responsible for that slot, then any value $$V[i]$$ is OK. Otherwise, server $$j$$ is responsible for slot $$i$$. Validate every tagged proposal signature in $$\textit{PSL}_j[i]$$ against its server tag, its round tag (which must be $$< r$$), the slot $$i$$, and the value sequence element $$V[i]$$. Any validation failure indicates corruption or a faulty leader: the message is ignored.

    After validation, server $$s$$ then applies the PROPOSE’s values and signatures. It goes by slot. For $$0 \leq i < |V|$$:

    1. If $$V[i] \neq \textit{AV}_s[i]$$, then the leader is adding a new slot or replacing a slot assigned in a previous round. Any old signatures are invalid; the server sets $$\textit{PSL}_s[i] \gets \textit{ASL}_s[i] \gets \{\}$$.

    2. Either way, the server then sets $$\textit{AV}_s[i] \gets V[i]$$ and inserts $$\textit{PS}[i]$$ (tagged with $$r$$ and $$\ell$$) to $$\textit{PSL}_s[i]$$.

    It also truncates $$\textit{AV}_s$$, $$\textit{PSL}_s$$, and $$\textit{ASL}_s$$ to length at most $$|V|$$. (Inappropriately short PROPOSEs were rejected already.)

    It sets $$\textit{pr}_s \gets r$$ and $$\textit{ar}_s \gets r$$.

    Finally, it witnesses the values it has received. It constructs a witness sequence $$\textit{WS}$$ and broadcasts this sequence to all servers in a CONFER:

    \[ \textit{WS}[i] = [\text{Sign}_s(\text{PROPOSE}, \textit{ar}_s, i, H(\textit{AV}_s[i])) \mid 0 \leq i < L]; \\ \forall k, \SEND{s}{k}{\text{CONFER}, \textit{ar}_s, \textit{WS}} \]

Confer phase

All servers collect CONFER messages and use them to grow their proposal signature cohorts. When $$2f+1$$ servers sign the same proposal for a slot, that slot moves into the acknowledgement phase.

  1. When server $$s$$ receives a $$\langle \text{CONFER}, r, \textit{WS}\rangle$$ message from some server $$j$$, it checks it for validity, applies its proposal signatures, and acknowledges slots as possible.

    A valid CONFER message is for the current round and has valid signatures.

    • If $$r \neq \textit{ar}_s$$ or $$\textit{pr}_s \neq \textit{ar}_s$$, the message is from a different round or the server is transitioning between rounds. The message is ignored.

    • $$\textit{WS}$$ might have surplus witnesses. Truncate $$\textit{WS}$$ to length at most $$|\textit{AV}_s|$$ to drop them.

    • Then, if any of the signatures in $$\textit{WS}$$ are invalid based on $$r$$, $$j$$, and the values in $$\textit{AV}_s$$, the message is corrupted. It is ignored.

    On receiving a valid CONFER, for each slot $$0 \leq i < |\textit{WS}|$$, server $$s$$ adds $$\textit{WS}[i]$$ to $$\textit{PSL}_s[i]$$.

    The server then searches the proposal signature log for quorum cohorts. Let $$L$$ be the minimum slot number so that $$\textit{PSL}_s[L]$$ does not contain a quorum cohort. This server believes that all slots up to $$L$$ have been decided, though the decision is not yet final because we’ve had only one message exchange. The group finalizes the decisions by exchanging ACKs.

    Given $$L$$, server $$s$$ constructs an ack sequence $$\textit{AS}$$ and broadcasts this sequence to all servers in an ACK:

    \[ \textit{AS} = [\text{Sign}_s(\text{ACK}, \textit{ar}_s, i, H(\textit{AV}_s[i])) \mid 0 \leq i < L]; \\ \forall k, \SEND{s}{k}{\text{ACK}, \textit{ar}_s, \textit{AS}} \]

Acknowledgement/decision phase

In Multi-Paxos, the acknowledgement phase is distinct from the decision phase because only leaders collect quorums—a separate DECIDE message informs replicas of the decision point. In PBFT, all servers collect quorums, so all servers send DECIDE messages to Nancy without waiting for leader confirmation.

  1. When server $$s$$ receives an $$\langle \text{ACK}, r, \textit{AS}\rangle$$ message from some server $$j$$, it checks it for validity, applies its acknowledgement signatures, and marks any newly-reached decisions.

    A valid ACK message is for the current round and has valid signatures.

    • If $$r \neq \textit{ar}_s$$ or $$\textit{pr}_s \neq \textit{ar}_s$$, the message is from a different round or the server is transitioning between rounds. The message is ignored.

    • $$\textit{AS}$$ might have surplus acknowledgements. Truncate $$\textit{AS}$$ to length at most $$|\textit{AV}_s|$$ to drop them.

    • Then, if any of the signatures in $$\textit{AS}$$ are invalid based on $$r$$, $$j$$, and the values in $$\textit{AV}_s$$, the message is corrupted. It is ignored.

    On receiving a valid ACK, for each slot $$0 \leq i < |\textit{AS}|$$, server $$s$$ adds $$\textit{AS}[i]$$ to $$\textit{ASL}_s[i]$$.

    The server then searches the acknowledgement signature log for quorum cohorts. Let $$L$$ be the minimum slot number so that $$\textit{ASL}_s[L]$$ does not contain a quorum cohort. All slots up to $$L$$ have been decided, and no slots after $$L$$ have been.

    If $$L > |\textit{DV}_s|$$, new decisions have been made. The server sets $$\textit{DV}_s$$ to the $$L$$-length prefix of $$\textit{AV}_s$$ and sends a DECIDE message to Nancy:

    \[ \SEND{s}{\text{Nancy}}{\text{DECIDE}, \textit{DV}_s} \]

Extension phase

As in multi-Paxos, a stable leader can send newly-extended PROPOSE messages with new elements of $$\textit{IV}$$. The network will send CONFER, ACK, and DECIDE messages in response as appropriate. Extension PROPOSEs do not need to contain the $$\textit{PSLS}$$ proposal signature log set as their prefixes have already been validated.

Denial of service, performance

The algorithm as written above accepts PROBE messages unconditionally. This allows any faulty server to prevent a group from making progress; it can just continually propose higher and higher rounds.

Escalating PROBE rounds are just one form of denial-of-service attack possible in BFT protocols. True Byzantine fault tolerance requires addressing every DoS opportunity. (Network denial-of-service is also possible; Aardvark shows one way to respond.)

The published PBFT protocol prevents PROBE DoS by entering new rounds only after quorum agreement. Each server sets a failure detector, and sends a PREPARE-equivalent to enter the next round only when its failure detector goes off.3 But other mitigations are possible. A server could ignore PROBEs unless it is experiencing network problems; it could ignore repeated PROBEs from the same server; and so forth.

The multi-Paxos optimizations matter even more here, since signing messages is expensive. It’s critical to send truncated logs in PROPOSE, CONFER, and ACK messages. That leaves PREPARE messages. To optimize them, the PBFT paper also discusses a checkpoint system by which a quorum of servers prove they agree on all log entries before some checkpoint slot $$I$$; checkpoint signatures can replace prefixes of the proposal signature logs in PREPARE and new-view PROPOSE messages.

Correctness sketch: Safety

Say that $$r$$ is the earliest round in which some non-faulty server $$j$$ sends a DECIDE for slot $$i$$ with value $$V$$.

But what about a future round? Could a faulty leader change the value in slot $$i$$, allowing a conflicting decision? No!


  1. I called this message CONFER in class, then thought WITNESS was a better name because the message contains “witnesses” of specific proposals, and now CONFER seems better after all, who knows. ↩︎

  2. This is what PBFT does. ↩︎

  3. PBFT calls PREPARE messages VIEW-CHANGE. Once a quorum of VIEW-CHANGEs are received, the leader of the next round sends NEW-VIEW, which corresponds to our PROPOSE message with a $$\textit{PSLS}$$ component. ↩︎