skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 28, 2026

Title: One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.  more » « less
Award ID(s):
2317751
PAR ID:
10586678
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
22nd USENIX Symposium on Networked Systems Design and Implementation
Date Published:
Format(s):
Medium: X
Location:
Philadelphia, PA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations. 
    more » « less
  2. null (Ed.)
    This paper shows how to use bounded-time recovery (BTR) to defend distributed systems against non-crash faults and attacks. Unlike many existing fault-tolerance techniques, BTR does not attempt to completely mask all symptoms of a fault; instead, it ensures that the system returns to the correct behavior within a bounded amount of time. This weaker guarantee is sufficient, e.g., for many cyber-physical systems, where physical properties - such as inertia and thermal capacity - prevent quick state changes and thus limit the damage that can result from a brief period of undefined behavior. We present an algorithm called REBOUND that can provide BTR for the Byzantine fault model. REBOUND works by detecting faults and then reconfiguring the system to exclude the faulty nodes. This supports very fine-grained responses to faults: for instance, the system can move or replace existing tasks, or drop less critical tasks entirely to conserve resources. REBOUND can take useful actions even when a majority of the nodes is compromised, and it requires less redundancy than full fault-tolerance. 
    more » « less
  3. Secret sharing is an essential tool for many distributed applications, including distributed key generation and multiparty computation. For many practical applications, we would like to tolerate network churn, meaning participants can dynamically enter and leave the pool of protocol participants as they please. Such protocols, called Dynamic-committee Proactive Secret Sharing (DPSS) have recently been studied; however, existing DPSS protocols do not gracefully handle faults: the presence of even one unexpectedly slow node can often slow down the whole protocol by a factor of O(n). In this work, we explore optimally fault-tolerant asynchronous DPSS that is not slowed down by crash faults and even handles byzantine faults while maintaining the same performance. We first introduce the first high-threshold DPSS, which offers favorable characteristics relative to prior non-synchronous works in the presence of faults while simultaneously supporting higher privacy thresholds. We then batch-amortize this scheme along with a parallel non-high-threshold scheme which achieves optimal bandwidth characteristics. We implement our schemes and demonstrate that they can compete with prior work in best-case performance while outperforming it in non-optimal settings. 
    more » « less
  4. Abstract Slow slip is part of the earthquake cycle, but the processes controlling this phenomenon in space and time are poorly constrained. Hematite, common in continental fault zones, exhibits unique textures and (U-Th)/He thermochronometry data patterns reflecting different slip rates. We investigated networks of small hematite-coated slip surfaces in basement fault damage of exhumed strike-slip faults that connect to the southern San Andreas fault in a flower structure in the Mecca Hills, California, USA. Scanning electron microscopy shows these millimeter-thick surfaces exhibit basal hematite injection veins and layered veinlets comprising nanoscale, high-aspect-ratio hematite plates akin to phyllosilicates. Combined microstructural and hematite (U-Th)/He data (n = 64 new, 24 published individual analyses) record hematite mineralization events ca. 0.8 Ma to 0.4 Ma at <1.5 km depth. We suggest these hematite faults formed via fluid overpressure, and then hematite localized repeated subseismic slip, creating zones of shallow off-fault damage as far as 4 km orthogonal to the trace of the southern San Andreas fault. Distributed hematite slip surfaces develop by, and then accommodate, transient slow slip, potentially dampening or distributing earthquake energy in shallow continental faults. 
    more » « less
  5. Abstract The Eastern California shear zone is a complex set of dextral faults that accommodates significant plate motion and has produced large earthquakes. The evolution of this system and why it consists of closely spaced, irregular faults that fail in multi‐fault ruptures are not well understood. Here we analyze the geometry, spatial distribution, and Quaternary slip activity of right‐lateral faults in the southern Mojave block. We find these faults are oriented favorably for accommodating regional dextral plate motion and do not show evidence of replacement following counterclockwise rotation to unfavorable positions, although activity may be migrating westward as previously proposed. We also confirm that the shear zone is transpressive, with widespread restraining bends, distributed convergent deformation, and significant impact on near‐fault topography. Observations also show that faults are geometrically complex, as represented by along‐strike variability in fault strike. We document a correlation between strike variability and fault activity (slip rate or net slip), which is evident within the shear zone as well as for a control group of other faults. We suggest that strike variability represents a form of geometric roughness, which may inhibit fault slip and result in complex ruptures, slip‐strengthening behavior, and a prevalence of off‐fault deformation. Other factors, including preexisting crustal fabric, edge effects, and changes in the stress field, may further complicate kinematics. These results suggest that faults of the shear zone are still juvenile and somewhat unique, yet offer an important window into how broadly distributed shear may evolve into a through‐going continental transform system. 
    more » « less