Byzantine fault-tolerant (BFT) state machine replication (SMR) protocols form the basis of modern blockchains as they maintain a consistent state across all blockchain nodes while tolerating a bounded number of Byzantine faults. We analyze BFT SMR in the excessive fault setting where the actual number of Byzantine faults surpasses a protocol's tolerance. We start by devising the very first repair algorithm for linearly chained and quorum-based partially synchronous SMR to recover from faulty states caused by excessive faults. Such a procedure can be realized using any commission fault detection module— an algorithm that identifies the faulty replicas without falsely locating any correct replica. We achieve this with a slightly weaker liveness guarantee, as the original security notion is impossible to satisfy given excessive faults. We implement recoverable HotStuff in Rust. The throughput resumes to the normal level (without excessive faults) after recovery routines terminate for 7 replicas and is slightly reduced by ≤ 4.3% for 30 replicas. On average, it increases the latency by 12.87% for 7 replicas and 8.85% for 30 replicas. Aside from adopting existing detection modules, we also establish the sufficient condition for a general BFT SMR protocol to allow for complete and sound fault detection when up to (n - 2) Byzantine replicas (out of n total replicas) attack safety. We start by providing the first closed-box fault detection algorithm for any SMR protocol without any extra rounds of communication. We then describe open-box instantiations of our fault detection routines in Tendermint and Hotstuff, further reducing the overhead, both asymptotically and concretely.
more »
« less
Recover from excessive faults in partially-synchronous BFT SMR
Byzantine fault-tolerant (BFT) state machine replication (SMR) protocols form the basis of modern blockchains as they maintain a consistent state across all blockchain nodes while tolerating a bounded number of Byzantine faults. We analyze BFT SMR in the excessive fault setting where the actual number of Byzantine faults surpasses a protocol's tolerance. We start by devising the very first repair algorithm for linearly chained and quorum-based partially synchronous SMR to recover from faulty states caused by excessive faults. Such a procedure can be realized using any commission fault detection module— an algorithm that identifies the faulty replicas without falsely locating any correct replica. We achieve this with a slightly weaker liveness guarantee, as the original security notion is impossible to satisfy given excessive faults. We implement recoverable HotStuff in Rust. The throughput resumes to the normal level (without excessive faults) after recovery routines terminate for 7 replicas and is slightly reduced by ≤ 4.3% for 30 replicas. On average, it increases the latency by 12.87% for 7 replicas and 8.85% for 30 replicas. Aside from adopting existing detection modules, we also establish the sufficient condition for a general BFT SMR protocol to allow for complete and sound fault detection when up to (n - 2) Byzantine replicas (out of n total replicas) attack safety. We start by providing the first closed-box fault detection algorithm for any SMR protocol without any extra rounds of communication. We then describe open-box instantiations of our fault detection routines in Tendermint and Hotstuff, further reducing the overhead, both asymptotically and concretely.
more »
« less
- Award ID(s):
- 2237814
- PAR ID:
- 10664034
- Publisher / Repository:
- Usenix Security 2025
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Real-time embedded systems perform many important functions in the modern world. A standard way to tolerate faults in these systems is with Byzantine fault-tolerant (BFT) state machine replication (SMR), in which multiple replicas execute the same software and their outputs are compared by the actuators. Unfortunately, traditional BFT SMR protocols areslow, requiring replicas to exchange sensor data back and forth over multiple rounds in order to reach agreement before each execution. The state of the art in reducing the latency of BFT SMR iseager execution, in which replicas execute on data from different sensors simultaneously on different processor cores. However, this technique results in 3–5× higher computation overheads compared to traditional BFT SMR systems, significantly limiting schedulability. We presentCrossTalk, a new BFT SMR protocol that leverages the prevalence of redundant switched networks in embedded systems to reduce latency without added computation. The key idea is to use specific algorithms to move messages between redundant network planes (which many systems already possess) as the messages travel from the sensors to the replicas. As a result,CrossTalkcan ensure agreementautomaticallyin the network, avoiding the need for any communication between replicas. Our evaluation shows thatCrossTalkimproves schedulability by 2.13–4.24× over the state of the art. Moreover, in a NASA simulation of a real spaceflight mission,CrossTalktolerates more faults than the state of the art while using nearly 3× less processor time.more » « less
-
null (Ed.)Byzantine Fault Tolerant (BFT) protocols are designed to ensure correctness and eventual progress in the face of misbehaving nodes [1]. However, this does not prevent negative effects an adversary may have on performance: a faulty node may significantly affect the latency and throughput of the system without being detected. This is especially true in speculative protocols optimized for the best-case where a single leader can force the protocol into the worst case [3]. Systems like Aardvark [2] that are designed to maximize worst-case performance tolerate byzantine behavior without necessarily detecting who the perpetrator is. By forcing regular view changes, for example, they mitigate the effects of leaders who deliberately delay dissemination of messages, even if this behavior would be difficult to prove to a third party. Byzantine faults, by definition, can be difficult to detect. An error of 'commission', such as a message with a mismatching digest, can be proven. Errors of 'omission', such as delaying or failing to relay a message, as a rule cannot be proven, and the node responsible for these types of omission faults may not appear faulty to all observers. Nevertheless, we observe that they can reliably be detected. Designing protocols that detect and eject nodes is challenging for two reasons. First, some behaviors are observed by a subset of honest nodes and cannot be objectively proven to a third party. Second, any mechanism capable of ejecting nodes could be subverted by Byzantine nodes to eject honest nodes. This paper presents the Protocol for Ejecting All Corrupted Hosts (Peach, a mechanism for detecting and ejecting faulty nodes in Byzantine fault tolerant (BFT) protocols. Nodes submit votes to a trusted configuration manager that replaces faulty nodes once a threshold of votes are received. We implement Peach for two BFT protocol variants, a traditional pbft-style three-phase protocol and a speculative protocol, and evaluate its ability to respond to Byzantine behavior. This work makes the following contributions: (1) We present and prove a necessary and sufficient constraint on cluster membership guaranteeing that any nodes causing performance degradation via acts of omission will be detected. (2) We present an agreement protocol, PEACHes, in which replicas pass votes about their subjective local observations of possible omissions to a TTP. (3) We show how the separation of detection and effectuation allows fine-grained detection of malicious behavior that is compatible and easily integrated with existing systems. (4) We present DecentBFT, an extension of BFT-Smart to which we added a speculative fast path (similar to Zyzzva) and integrated PEACHes. (5) We show DecentBFT rapidly detects and mitigates a variety of performance attacks that would have gone undetected by the state of the art.more » « less
-
Distributed data management systems use state Machine Replication (SMR) to provide fault tolerance. The SMR algorithm enables Byzantine Fault-Tolerant (BFT) protocols to guarantee safety and liveness despite the malicious failure of nodes. However, SMR does not prevent the adversarial manipulation of the order of transactions, where the order assigned by a malicious leader differs from the order in that transactions are received from clients. Whileorder-fairnesshas been recently studied in a few protocols, such protocols rely on synchronized clocks, suffer from liveness issues, or incur significant performance overhead. This paper presentsRashnu, a high-performance fair ordering protocol. Rashnu is motivated by the fact that fair ordering among two transactions is needed only when both transactions access a shared resource. Based on this observation, we define the notion ofdata-dependent order fairnesswhere replicas capture only the order of data-dependent transactions and the leader uses these orders to propose a dependency graph that represents fair ordering among transactions. Replicas then execute transactions using the dependency graph, resulting in the parallel execution of independent transactions. We implemented a prototype of Rashnu where our experimental evaluation reveals the low overhead of providing order-fairness in Rashnu.more » « less
-
This paper presents Pesto, a high-performance Byzantine Fault Tolerant (BFT) database that offers full SQL compatibility. Pesto intentionally forgoes the use of State Machine Replication (SMR); SMR-based designs offer poor performance due to the several round trips required to order transactions. Pesto, instead, allows for replicas to remain inconsistent, and only synchronizes on demand to ensure that the database remain serializable in the presence of concurrent transactions and malicious actors. On TPC-C, Pesto matches the throughput of Peloton and Postgres, two unreplicated SQL database systems, while increasing throughput by 2.3x compared to classic SMR-based BFT-architectures, and reducing latency by 2.7x to 3.9x. Pesto's leaderless design minimizes the impact of replica failures and ensures robust performance.more » « less
An official website of the United States government

