One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng

Citation Details

This content will become publicly available on April 28, 2026

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults. more »

Award ID(s):: 2317751 2317698

PAR ID:: 10586678

Author(s) / Creator(s):: Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng

Publisher / Repository:: 22nd USENIX Symposium on Networked Systems Design and Implementation

Date Published:: 2025-04-28

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on April 28, 2026
Conference Paper:
The DOI is not currently available.

More Like this