Fail-slow fault tolerance needs programming support

Yoo, Andrew; Wang, Yuanli; Sinha, Ritesh; Mu, Shuai; Xu, Tianyin

doi:10.1145/3458336.3465299

Citation Details

Fail-slow fault tolerance needs programming support

The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations. more »

Award ID(s):: 2029049 1816615

PAR ID:: 10293054

Author(s) / Creator(s):: Yoo, Andrew; Wang, Yuanli; Sinha, Ritesh; Mu, Shuai; Xu, Tianyin

Date Published:: 2021-06-01

Journal Name:: In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII)

Page Range / eLocation ID:: 228 to 235

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3458336.3465299

More Like this