Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

Pan, Jia; Wu, Haoze; Leesatapornwongsa, Tanakorn; Nath, Suman; Huang, Peng

doi:10.1145/3694715.3695979

Citation Details

This content will become publicly available on November 4, 2025

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging. We explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, Anduril, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate Anduril on 22 real-world complex fault-induced failures from five large-scale distributed systems. Anduril reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes. more »

Award ID(s):: 2317698

PAR ID:: 10555767

Author(s) / Creator(s):: Pan, Jia; Wu, Haoze; Leesatapornwongsa, Tanakorn; Nath, Suman; Huang, Peng

Publisher / Repository:: ACM

Date Published:: 2024-11-04

ISBN:: 9798400712517

Page Range / eLocation ID:: 46 to 62

Format(s):: Medium: X

Location:: Austin TX USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on November 4, 2025
Conference Paper:
https://doi.org/10.1145/3694715.3695979

More Like this