Byzantine Fault Tolerance (BFT) is a classic technique for defending distributed systems against a wide range of faults and attacks. However, existing solutions are designed for systems where nodes can interact only by exchanging messages. They are not directly applicable to systems where nodes have sensors and actuators and can also interact in the physical world – perhaps by blocking each other’s path or by crashing into each other. In this paper, we take a first stab at extending BFT to this larger class of systems. We focus on multi-robot systems (MRS), an emerging technology that is increasingly being deployed for applications such as target tracking, warehouse logistics, and exploration. An MRS can consist of dozens of interacting robots and is thus a bona-fide distributed system. The classic masking guarantee is not practical in a MRS, but we propose a variant called bounded-time interaction that can be implemented, and we present an algorithm that achieves it, in combination with a few small hardware tweaks. We built a simulator and prototyped wheeled robots to show that our algorithm is effective, and that it has a reasonable overhead.
more »
« less
REBOUND: Defending Distributed Systems Against Attacks with Bounded-Time Recovery
This paper shows how to use bounded-time recovery (BTR) to defend distributed systems against non-crash faults and attacks. Unlike many existing fault-tolerance techniques, BTR does not attempt to completely mask all symptoms of a fault; instead, it ensures that the system returns to the correct behavior within a bounded amount of time. This weaker guarantee is sufficient, e.g., for many cyber-physical systems, where physical properties - such as inertia and thermal capacity - prevent quick state changes and thus limit the damage that can result from a brief period of undefined behavior. We present an algorithm called REBOUND that can provide BTR for the Byzantine fault model. REBOUND works by detecting faults and then reconfiguring the system to exclude the faulty nodes. This supports very fine-grained responses to faults: for instance, the system can move or replace existing tasks, or drop less critical tasks entirely to conserve resources. REBOUND can take useful actions even when a majority of the nodes is compromised, and it requires less redundancy than full fault-tolerance.
more »
« less
- PAR ID:
- 10282457
- Date Published:
- Journal Name:
- Proceedings of the 16th European Conference on Computer Systems (EuroSys'21)
- Page Range / eLocation ID:
- 523 to 539
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
For real-time embedded systems, QoS (Quality of Service), fault tolerance, and energy budget constraint are among the primary design concerns. In this research, we investigate the problem of energy constrained standby-sparing for both periodic and aperiodic tasks in a weakly hard real-time environment. The standby-sparing systems adopt a primary processor and a spare processor to provide fault tolerance for both permanent and transient faults. For such kind of systems, we firstly propose several novel standby-sparing schemes for the periodic tasks which can ensure the system feasibility under tighter energy budget constraint than the traditional ones. Then based on them integrated approachs for both periodic and aperiodic tasks are proposed to minimize the aperiodic response time whilst achieving better energy and QoS performance under the given energy budget constraint. The evaluation results demonstrated that the proposed techniques significantly outperformed the existing state of the art approaches in terms of feasibility and system performance while ensuring QoS and fault tolerance under the given energy budget constraint.more » « less
-
Fault tolerance, energy management, and quality of service (QoS) are essential aspects for the design of real-time embedded systems. In this work, we focus on exploring methods that can simultaneously address the above three critical issues under standby-sparing. The standby-sparing mechanism adopts a dual-processor architecture in which each processor plays the role of the backup for the other one dynamically. In this way, it can provide fault tolerance subject to both permanent and transient faults. Due to its duplicate executions of the real-time jobs/tasks, the energy consumption of a standby-sparing system could be quite high. With the purpose of reducing energy under standby-sparing, we proposed three novel scheduling schemes: The first one is for (1, 1)-constrained tasks, and the second one and the third one (which can be combined into an integrated approach to maximize the overall energy reduction) are for general (m,k)-constrained tasks that require that among anykconsecutive jobs of a task no more than (k-m) out of them could miss their deadlines. Through extensive evaluations and performance analysis, our results demonstrate that compared with the existing research, the proposed techniques can reduce energy by up to 11% for (1, 1)-constrained tasks and 25% for general (m,k)-constrained tasks while assuring (m,k)-constraints and fault tolerance as well as providing better user perceived QoS levels under standby-sparing.more » « less
-
As the design space for high-performance computer (HPC) systems grows larger and more complex, modeling and simulation (MODSIM) techniques become more important to better optimize systems. Furthermore, recent extreme-scale systems and newer technologies can lead to higher system fault rates, which negatively affect system performance and other metrics. Therefore, it is important for system designers to consider the effects of faults and fault-tolerance (FT) techniques on system design through MODSIM. BE-SST is an existing MODSIM methodology and workflow that facilitates preliminary exploration & reduction of large design spaces, particularly by highlighting areas of the space for detailed study and pruning less optimal areas. This paper presents the overall methodology for adding fault-tolerance awareness (FT-awareness) into BE-SST. We present the process used to extend BE-SST, enabling the creation of models that predict the time needed to perform a checkpoint instance for the given system configuration. Additionally, this paper presents a case study where a full HPC system is simulated using BE-SST, including application, hardware, and checkpointing. We validate the models and simulation against actual system measurements, finding an average percent error of less than 17% for the instance models and about 20% for system simulation, a level of accuracy acceptable for initial exploration and pruning of the design space. Finally, we show how FT-aware simulation results are used for comparing FT levels in the design space.more » « less
-
null (Ed.)The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.more » « less