- Publication Date:
- NSF-PAR ID:
- Journal Name:
- In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII)
- Page Range or eLocation-ID:
- 228 to 235
- Sponsoring Org:
- National Science Foundation
More Like this
This paper shows how to use bounded-time recovery (BTR) to defend distributed systems against non-crash faults and attacks. Unlike many existing fault-tolerance techniques, BTR does not attempt to completely mask all symptoms of a fault; instead, it ensures that the system returns to the correct behavior within a bounded amount of time. This weaker guarantee is sufficient, e.g., for many cyber-physical systems, where physical properties - such as inertia and thermal capacity - prevent quick state changes and thus limit the damage that can result from a brief period of undefined behavior. We present an algorithm called REBOUND that can provide BTR for the Byzantine fault model. REBOUND works by detecting faults and then reconfiguring the system to exclude the faulty nodes. This supports very fine-grained responses to faults: for instance, the system can move or replace existing tasks, or drop less critical tasks entirely to conserve resources. REBOUND can take useful actions even when a majority of the nodes is compromised, and it requires less redundancy than full fault-tolerance.
This work is an experience with a deployed networked system for digital agriculture (or DA). Digital agriculture is the use of data-driven techniques towards a sustainable increase in farm productivity and efficiency. DA systems are expected to be overlaid on existing rural infrastructures, which are known to be less robust. While existing DA approaches partially address several infrastructure issues, challenges related to data aggregation, data analytics, and fault tolerance remain open. In this work, we present the design of Comosum, an extensible, reconfigurable, and fault-tolerant architecture of hardware, software, and distributed cloud abstractions to sense, analyze, and actuate on different farm types. FarmBIOS is an implementation of the Comosum architecture. We analyze FarmBIOS by leveraging various applications, deployment experiences, and network differences between urban and rural farms. This includes, for instance, an edge analytics application achieving 86% accuracy in vineyard disease detection. An eighteen-month deployment of FarmBIOS highlights Comosum’s fault tolerance. It was fault tolerant to intermittent network outages that lasted for several days during many periods of the deployment. We introduce active digital twins to cope with the unreliability of the underlying base systems.
Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.
We propose the Write-Once Register (WOR) as an abstraction for building and verifying distributed systems. A WOR exposes a simple, data-centric API: clients can capture, write, and read it. Applications can use a sequence or a set of WORs to obtain properties such as durability, concurrency control, and failure atomicity. By hiding the logic for distributed coordination underneath a data-centric API, the WOR abstraction enables easy, incremental, and extensible implementation and verification of applications built above it. We present the design, implementation, and verification of a system called WormSpace that provides developers with an address space of WORs, implementing each WOR via a Paxos instance. We describe three applications built over WormSpace: a flexible, efficient Multi-Paxos implementation; a shared log implementation with lower append latency than the state-of-the-art; and a fault-tolerant transaction coordinator that uses an optimal number of round-trips. We show that these applications are simple, easy to verify, and match the performance of unverified monolithic implementations. We use a modular layered verification approach to link the proofs for WormSpace, its applications, and a verified operating system to produce the first verified distributed system stack from the application to the operating system.
We conceptualize a
decentralizedsoftware application as one constituted from autonomousagents that communicate via asynchronousmessaging. Modern software paradigms such as microservices and settings such as the Internet of Things evidence a growing interest in decentralized applications. Constructing a decentralized application involves designing agents as independent local computations that coordinate successfully to realize the application’s requirements. Moreover, a decentralized application is susceptible to faults manifested as message loss, delay, and reordering. We contribute Mandrake, a programming model for decentralized applications that tackles these challenges without relying on infrastructure guarantees. Specifically, we adopt the construct of an information protocolthat specifies messaging between agents purely in causal terms and can be correctly enacted by agents in a shared-nothing environment over nothing more than unreliable, unordered transport. Mandrake facilitates (1) implementing protocol-compliant agents by introducing a programming model; (2) transforming protocols into fault-tolerant ones with simple annotations; and (3) a declarative policy language that makes it easy to implement fault-tolerance in agents based on the capabilities in protocols. Mandrake’s significance lies in demonstrating a straightforward approach for constructing decentralized applications without relying on coordination mechanisms in the infrastructure, thus achieving some of the goals of the founders of networked computing from the 1970s.