[Paper Link](https://www.scs.stanford.edu/nyu/03sp/sched/bfs.pdf) Algorithm: [[PBFT]] # Abstract Describes a new replication algorithm that is able to tolerate Byzantine faults. Practical: works in asynchronous environments and improves response time # Introduction Malicious attacks and software errors can cause faulty nodes to exhibit Byzantine (i.e., arbitrary) behavior, so BFT algorithms are increasingly important. **Paper presents a new practical algorithm for state machine replication that tolerates Byzantine faults** Offers **Liveness and Safety** as loss as less than 1/3 nodes are faulty Does not rely on synchrony for safety One message round trip to execute read-only ops and two for read-write ops Authentication scheme based on message authentication codes during normal operation, and pubkey when there are faults **The paper makes the following contributions:** - Describes the first state-machine replication protocol that correctly survives Byzantine faults in asynchronous networks - Describes a number of important optimizations that allow the algorithm to perform well so that it can be used in real systems - Describes the implementation of a BFT distributed file system - Provides experimental results that quantify the cost of the replication technique. # System Model Asynchronous distributed system where nodes are connected to a network Network may fail to deliver messages, delay them, duplicate them, or deliver out of order Byzantine Failure Model - Faulty nodes may have arbitrarily, assuming independent node failures Pubkey signatures are used to prevent spoofing and replays Denote the message m signed by node i as lt;m>_{\sigma _{i}}$ The digest of m by D(m) Allow for a strong adversary that can coordinate faulty nodes, delay communication, or delay correct nodes. Assume adversary cannot delay correct nodes indefinitely # Service Properties The safety guarantee means that it behaves like a centralized implementation that executes operations atomically one at a time Relies on synchrony to provide liveness; otherwise, it could be used to implement consensus in asynchronous system, which is not possible Liveness guarantee means that clients eventually receive replies to their request, provided that at most one third nodes are faulty and delay(t), the time between the moment t when a message is sent for the first time and the moment when it is received by its destination, does not grow indefinitely. 3f + 1 is the minimum number of nodes that allow an asynchronous system to provide safety and liveness properties when up to f are faulty. Must be possible to proceed after communicating with n-f replicas, since f might be faulty and not responding. It's possible that the f that did not respond are not faulty, and therefore f of those that did respond might be faulty. Even so, there must be enough responses from non-faulty to outnumber those from faulty, n-2f > f, n > 3f # The Algorithm Form of state machine replication: service is modeled as a state machine that is replicated across different nodes in a distributed system. Each state machine replica maintains service state and implements ops. Replicas move through a succession of configurations called *views*. In a view, one replica is primary and the others are backups Primary of a view is replica p such that p = v mod |R|, where vi s the view number. View changes carried out when it appears that the primary has failed *Think of replicas as nodes, where the primary is a block proposer and replicas are the voters* Roughly, the algorithm works like: 1. A client sends a request to invoke a service operation to the primary 2. The primary multicasts the request to the backups 3. Replicas execute the request and send a reply to the client 4. Client waits for f + 1 replies from different replicas with the same result - this is the result of the operation Like all state machine replication techniques, two requirements are imposed on replicas: 1. They must be *deterministic* - the execution of an operation in a given state and with a given set of arguments must always produce the same result 2. They must all start in the same state All faulty replicas agree on a total order for the execution of requests despite failures Remainder of section describes simplified version of algo: ## The Client A client c requests the execution of state machine operation o by sending a lt;REQUEST, o,t,c>_{\sigma_{c}}$ message to the primary Primary atomically multicasts the request to all the backups Replica sends the reply to the request directly to the client, in form lt;REPLY,v,t,c,i,r>_{\sigma_{i}}$ r is the result of executing the operation Client waits for f+1 replies with valid sigs with the same r before accepting r as the result If client does not receive replies soon enough, it broadcasts request to all replicas, who either resend their reply, relay request to primary, or lead to the suspicion of the primary being faulty in which case a view change occurs. ## Normal-Case Operation State of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica's current view. When the primary p receives client request m, it starts a three-phase protocol unless the messages in progress is too large, in which case it's added to a buffer. Three phases are *pre-prepare, prepare, and commit* The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the primary, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views In pre-prepare, the primary assigns a sequence number n to the request, multicasts a pre-prepare message with parameters v,n, D(m), and m piggybacked. Backup accepts a pre-prepare message if the signatures are valid, it's in the same view, it hasn't already accepted a pre-prepare message for v and n containing a different digest, and the sequence number is between bounds h and H (prevents faulty primary from exhausting space of sequence numbers) If backup i accepts it, it enters *prepare* and multicasts lt;PREPARE,v,n,d,i>_{\sigma_{i}}$ to all other replicas and adds both messages to its log. Otherwise, it does nothing h is equal to the sequence number of the last stable checkpoint H = h+k, where k is big enough that replicas do not stall waiting for a checkpoint to become stable (ex. double checkpoint interval) Replicas then accept and add prepare messages to their log if above conditions are valid. Define prepared(m,v,n,i) to be true iff replica i has the request, a pre-prepare, and 2f prepares from other backups that match in its log. These two phases guarantee that non-faulty replicas agree on a total order for the requests within a view Replica i multicasts lt;COMMIT,v,n,D(m),i>_{\sigma_{i}}$ when prepared(m,v,n,i) becomes true, starting the *commit* phase. Others accept if previous conditions are true *committed(m,v,n)* is true iff prepared(m,v,n,i) is true for all i in some set of f + 1 non-faulty replicas *committed-local(m,v,n,i)* is true iff prepared(m,v,n,i) is true and i has accepted 2f+1 commits from replicas that match the pre-parepare for m A commit matches a pre-prepare if they have the same view, sequence number, and digest. If committed-local is true for some honest i, then committed is true. This, along with view-change protocol, ensure that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Also ensures that any requests that commits locally at a non-faulty replica will eventually commit at f+1 or more non-faulty replicas After committed-local is true, each replica executions the operation when it's time in the sequential execution, and sends a reply to the client ### Garbage Collection Mechanism to discard messages from the log Replicas must keep messages in log until their requests have been executed by at least f + 1 non-faulty replicas and it can prove this to others in view changes. If a replica misses messages that were discarded, it needs some way to be brought up to date Proofs that state is correct are generated periodically at a constant interval *Checkpoint* - States produced by the execution of these requests *Stable checkpoint* - Checkpoint with a proof Replica maintains copies of service state: last stable checkpoint, checkpoints that are not stable, and current state Proof of checkpoint is made by each replica multicasting a checkpoint message, and collecting messages from other replicas until it has 2f+1 of them for sequence number n with the same digest d. ### View Changes View-change protocol provides liveness by allowing the system to make progress when the primary fails Triggered by timeouts that prevent backups from waiting indefinitely for requests to execute. Backup is *waiting* for a request if it received a valid request and has not executed it.. Starts a timer when it receives a request If the timer of backup i expires in view v, the backup starts a view change to move the system to v+1. Stops accepting irrelevant messages and multicasts lt;VIEW-CHANGE,v+1,n,C,P,i>_{\sigma_{i}}$ n is the sequence number of the last stable checkpoint s known to i C is a set of 2f+1 valid checkpoint messages proving the correctness of s P is a set containing a set $P_{m}$ for each request m that prepared at i with a sequence number higher than n. Each $P_{m}$ contains a valid pre-prepare message with 2f matching prepare messages When the primary p of the new view v + 1 receives 2f view-change messages from other replicas, it multicasts lt;NEW-VIEW,V+1,V,O>_{\sigma_{p}}$ V is a set containing the valid view-change messages received by the primary plus the view-change message for v+! the primary sent O is a set of pre-prepare messages computed as follows: 1. Primary determines sequence number of the latest stable checkpoint in V, min-s, and the highest sequence number in a prepare message in V, max-s 2. Primary creates a new pre-prepare message for each sequences number between the min and max. If sequence number had data in v, it creates a new pre-prepare with that data. If not, it still makes a pre-prepare, but has it execute a no-op. Basically carries forward relevant information from the previous view or inserts placeholder requests to maintain a continuous sequence of operations Replicas redo the protocol for messages between min-s and max-s but avoid re-executing client requests by using their stored info about the last reply sent to each client If a replica is missing some message m or a stable checkpoint, since they are not sent in new-view messages, it can obtain missing info from another replica. ### Correctness See paper # Optimizations Improve the algorithm in normal-case operation while preserving safety and liveness ## Reducing Communication Avoid sending most large replies by designating one replica to send it and the rest to send digests Reduce number of message delays for operation invocation from 5 to 4 Improve performance of read-only ops that don't modify service state ## Cryptography Digital signatures are only used for view-change and new-view messages, and other messages are authenticated with message authentication codes They are faster, but unable to prove that a message is authentic to a third party # Implementation Not terribly relevant # Related Work Most previous work ignored Byzantine faults or assumed a synchronous system Tolerating Byzantine faults requires a much more complex protocol with cryptographic authentication, an extra pre-prepare phase, and a different technique to trigger view changes and select primaries. # Conclusions State-machine replication algorithm that is able to tolerate Byzantine faults and can be used in an asynchronous system