Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more »
« less
This content will become publicly available on April 28, 2026
One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more »
« less
- PAR ID:
- 10586678
- Publisher / Repository:
- 22nd USENIX Symposium on Networked Systems Design and Implementation
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.more » « less
-
null (Ed.)This paper shows how to use bounded-time recovery (BTR) to defend distributed systems against non-crash faults and attacks. Unlike many existing fault-tolerance techniques, BTR does not attempt to completely mask all symptoms of a fault; instead, it ensures that the system returns to the correct behavior within a bounded amount of time. This weaker guarantee is sufficient, e.g., for many cyber-physical systems, where physical properties - such as inertia and thermal capacity - prevent quick state changes and thus limit the damage that can result from a brief period of undefined behavior. We present an algorithm called REBOUND that can provide BTR for the Byzantine fault model. REBOUND works by detecting faults and then reconfiguring the system to exclude the faulty nodes. This supports very fine-grained responses to faults: for instance, the system can move or replace existing tasks, or drop less critical tasks entirely to conserve resources. REBOUND can take useful actions even when a majority of the nodes is compromised, and it requires less redundancy than full fault-tolerance.more » « less
-
Secret sharing is an essential tool for many distributed applications, including distributed key generation and multiparty computation. For many practical applications, we would like to tolerate network churn, meaning participants can dynamically enter and leave the pool of protocol participants as they please. Such protocols, called Dynamic-committee Proactive Secret Sharing (DPSS) have recently been studied; however, existing DPSS protocols do not gracefully handle faults: the presence of even one unexpectedly slow node can often slow down the whole protocol by a factor of O(n). In this work, we explore optimally fault-tolerant asynchronous DPSS that is not slowed down by crash faults and even handles byzantine faults while maintaining the same performance. We first introduce the first high-threshold DPSS, which offers favorable characteristics relative to prior non-synchronous works in the presence of faults while simultaneously supporting higher privacy thresholds. We then batch-amortize this scheme along with a parallel non-high-threshold scheme which achieves optimal bandwidth characteristics. We implement our schemes and demonstrate that they can compete with prior work in best-case performance while outperforming it in non-optimal settings.more » « less
-
Abstract Slow slip is part of the earthquake cycle, but the processes controlling this phenomenon in space and time are poorly constrained. Hematite, common in continental fault zones, exhibits unique textures and (U-Th)/He thermochronometry data patterns reflecting different slip rates. We investigated networks of small hematite-coated slip surfaces in basement fault damage of exhumed strike-slip faults that connect to the southern San Andreas fault in a flower structure in the Mecca Hills, California, USA. Scanning electron microscopy shows these millimeter-thick surfaces exhibit basal hematite injection veins and layered veinlets comprising nanoscale, high-aspect-ratio hematite plates akin to phyllosilicates. Combined microstructural and hematite (U-Th)/He data (n = 64 new, 24 published individual analyses) record hematite mineralization events ca. 0.8 Ma to 0.4 Ma at <1.5 km depth. We suggest these hematite faults formed via fluid overpressure, and then hematite localized repeated subseismic slip, creating zones of shallow off-fault damage as far as 4 km orthogonal to the trace of the southern San Andreas fault. Distributed hematite slip surfaces develop by, and then accommodate, transient slow slip, potentially dampening or distributing earthquake energy in shallow continental faults.more » « less
An official website of the United States government
