skip to main content


Title: Fail-slow fault tolerance needs programming support
The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.  more » « less
Award ID(s):
2029049 1816615
PAR ID:
10293054
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII)
Page Range / eLocation ID:
228 to 235
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    This paper shows how to use bounded-time recovery (BTR) to defend distributed systems against non-crash faults and attacks. Unlike many existing fault-tolerance techniques, BTR does not attempt to completely mask all symptoms of a fault; instead, it ensures that the system returns to the correct behavior within a bounded amount of time. This weaker guarantee is sufficient, e.g., for many cyber-physical systems, where physical properties - such as inertia and thermal capacity - prevent quick state changes and thus limit the damage that can result from a brief period of undefined behavior. We present an algorithm called REBOUND that can provide BTR for the Byzantine fault model. REBOUND works by detecting faults and then reconfiguring the system to exclude the faulty nodes. This supports very fine-grained responses to faults: for instance, the system can move or replace existing tasks, or drop less critical tasks entirely to conserve resources. REBOUND can take useful actions even when a majority of the nodes is compromised, and it requires less redundancy than full fault-tolerance. 
    more » « less
  2. This work is an experience with a deployed networked system for digital agriculture (or DA). Digital agriculture is the use of data-driven techniques towards a sustainable increase in farm productivity and efficiency. DA systems are expected to be overlaid on existing rural infrastructures, which are known to be less robust. While existing DA approaches partially address several infrastructure issues, challenges related to data aggregation, data analytics, and fault tolerance remain open. In this work, we present the design of Comosum, an extensible, reconfigurable, and fault-tolerant architecture of hardware, software, and distributed cloud abstractions to sense, analyze, and actuate on different farm types. FarmBIOS is an implementation of the Comosum architecture. We analyze FarmBIOS by leveraging various applications, deployment experiences, and network differences between urban and rural farms. This includes, for instance, an edge analytics application achieving 86% accuracy in vineyard disease detection. An eighteen-month deployment of FarmBIOS highlights Comosum’s fault tolerance. It was fault tolerant to intermittent network outages that lasted for several days during many periods of the deployment. We introduce active digital twins to cope with the unreliability of the underlying base systems. 
    more » « less
  3. Secret sharing is an essential tool for many distributed applications, including distributed key generation and multiparty computation. For many practical applications, we would like to tolerate network churn, meaning participants can dynamically enter and leave the pool of protocol participants as they please. Such protocols, called Dynamic-committee Proactive Secret Sharing (DPSS) have recently been studied; however, existing DPSS protocols do not gracefully handle faults: the presence of even one unexpectedly slow node can often slow down the whole protocol by a factor of O(n). In this work, we explore optimally fault-tolerant asynchronous DPSS that is not slowed down by crash faults and even handles byzantine faults while maintaining the same performance. We first introduce the first high-threshold DPSS, which offers favorable characteristics relative to prior non-synchronous works in the presence of faults while simultaneously supporting higher privacy thresholds. We then batch-amortize this scheme along with a parallel non-high-threshold scheme which achieves optimal bandwidth characteristics. We implement our schemes and demonstrate that they can compete with prior work in best-case performance while outperforming it in non-optimal settings. 
    more » « less
  4. Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results. 
    more » « less
  5. Abstract

    Advances in deep learning have revolutionized cyber‐physical applications, including the development of autonomous vehicles. However, real‐world collisions involving autonomous control of vehicles have raised significant safety concerns regarding the use of deep neural networks (DNNs) in safety‐critical tasks, particularly perception. The inherent unverifiability of DNNs poses a key challenge in ensuring their safe and reliable operation. In this work, we propose perception simplex ( ), a fault‐tolerant application architecture designed for obstacle detection and collision avoidance. We analyse an existing LiDAR‐based classical obstacle detection algorithm to establish strict bounds on its capabilities and limitations. Such analysis and verification have not been possible for deep learning‐based perception systems yet. By employing verifiable obstacle detection algorithms, identifies obstacle existence detection faults in the output of unverifiable DNN‐based object detectors. When faults with potential collision risks are detected, appropriate corrective actions are initiated. Through extensive analysis and software‐in‐the‐loop simulations, we demonstrate that provides deterministic fault tolerance against obstacle existence detection faults, establishing a robust safety guarantee.

     
    more » « less