skip to main content


Title: Aragog: Scalable Runtime Verification of Shardable Networked Systems
Network functions like firewalls, proxies, and NATs are instances of distributed systems that lie on the critical path for a substantial fraction of today's cloud applications. Unfortunately, validating these systems remains difficult due to their complex stateful, timed, and distributed behaviors. In this paper, we present the design and implementation of Aragog, a runtime verification system for distributed network functions that achieves high expressiveness, fidelity, and scalability. Given a property of interest, Aragogefficiently checks running systems for violations of the property with a scale-out architecture consisting of a collection of global verifiers and local monitors. To improve performance and reduce communication overhead, Aragog includes an array of optimizations that leverage properties of networked systems to suppress provably unnecessary system events and to shard verification over every available local and global component. We evaluate Aragog over several network functions including a NAT Gateway that powers Azure, identifying both design and implementation bugs in the process.  more » « less
Award ID(s):
1845749
NSF-PAR ID:
10228226
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose the Write-Once Register (WOR) as an abstraction for building and verifying distributed systems. A WOR exposes a simple, data-centric API: clients can capture, write, and read it. Applications can use a sequence or a set of WORs to obtain properties such as durability, concurrency control, and failure atomicity. By hiding the logic for distributed coordination underneath a data-centric API, the WOR abstraction enables easy, incremental, and extensible implementation and verification of applications built above it. We present the design, implementation, and verification of a system called WormSpace that provides developers with an address space of WORs, implementing each WOR via a Paxos instance. We describe three applications built over WormSpace: a flexible, efficient Multi-Paxos implementation; a shared log implementation with lower append latency than the state-of-the-art; and a fault-tolerant transaction coordinator that uses an optimal number of round-trips. We show that these applications are simple, easy to verify, and match the performance of unverified monolithic implementations. We use a modular layered verification approach to link the proofs for WormSpace, its applications, and a verified operating system to produce the first verified distributed system stack from the application to the operating system. 
    more » « less
  2. Elsman, Martin (Ed.)
    Protocols to ensure that messages are delivered in causal order are a ubiquitous building block of distributed systems. For instance, distributed data storage systems can use causally ordered message delivery to ensure causal consistency, and CRDTs can rely on the existence of an underlying causally-ordered messaging layer to simplify their implementation. A causal delivery protocol ensures that when a message is delivered to a process, any causally preceding messages sent to the same process have already been delivered to it. While causal delivery protocols are widely used, verification of their correctness is less common, much less machine-checked proofs about executable implementations. We implemented a standard causal broadcast protocol in Haskell and used the Liquid Haskell solver-aided verification system to express and mechanically prove that messages will never be delivered to a process in an order that violates causality. We express this property using refinement types and prove that it holds of our implementation, taking advantage of Liquid Haskell’s underlying SMT solver to automate parts of the proof and using its manual theorem-proving features for the rest. We then put our verified causal broadcast implementation to work as the foundation of a distributed key-value store. 
    more » « less
  3. We propose a new distributed learning-based framework for stability assessment of a class of networked nonlinear systems, where each subsystem is dissipative. The aim is to learn, in a distributed manner, a Lyapunov function and associated region of attraction for the networked system. We begin by using a neural network function approximation to learn a storage function for each subsystem such that the subsystem satisfies a local dissipativity property. We next use a satisfiability modulo theories (SMT) solver based falsifier that verifies the local dissipativity of each subsystem by deter- mining an absence of counterexamples that violate the local dissipativity property, as established by the neural network approximation. Finally, we verify network-level stability by using an alternating direction method of multipliers (ADMM) approach to update the storage function of each subsystem in a distributed manner until a global stability condition for the network of dissipative subsystems is satisfied. This step also leads to a network-level Lyapunov function that we then use to estimate a region of attraction. We illustrate the proposed algorithm and its advantages on a microgrid interconnection with power electronics interfaces. 
    more » « less
  4. N. Matni, M. Morari (Ed.)
    This paper proposes a computationally efficient framework, based on interval analysis, for rigorous verification of nonlinear continuous-time dynamical systems with neural network controllers. Given a neural network, we use an existing verification algorithm to construct inclusion functions for its input-output behavior. Inspired by mixed monotone theory, we embed the closed-loop dynamics into a larger system using an inclusion function of the neural network and a decomposition function of the open-loop system. This embedding provides a scalable approach for safety analysis of the neural control loop while preserving the nonlinear structure of the system. We show that one can efficiently compute hyper-rectangular over-approximations of the reachable sets using a single trajectory of the embedding system. We design an algorithm to leverage this computational advantage through partitioning strategies, improving our reachable set estimates while balancing its runtime with tunable parameters. We demonstrate the performance of this algorithm through two case studies. First, we demonstrate this method’s strength in complex nonlinear environments. Then, we show that our approach matches the performance of the state-of-the art verification algorithm for linear discretized systems. 
    more » « less
  5. Despite recent advances, guaranteeing the correctness of large-scale distributed applications without compromising performance remains a challenging problem. Network and node failures are inevitable and, for some applications, careful control over how they are handled is essential. Unfortunately, existing approaches either completely hide these failures behind an atomic state machine replication (SMR) interface, or expose all of the network-level details, sacrificing atomicity. We propose a novel, compositional, atomic distributed object (ADO) model for strongly consistent distributed systems that combines the best of both options. The object-oriented API abstracts over protocol-specific details and decouples high-level correctness reasoning from implementation choices. At the same time, it intentionally exposes an abstract view of certain key distributed failure cases, thus allowing for more fine-grained control over them than SMR-like models. We demonstrate that proving properties even of composite distributed systems can be straightforward with our Coq verification framework, Advert, thanks to the ADO model. We also show that a variety of common protocols including multi-Paxos and Chain Replication refine the ADO semantics, which allows one to freely choose among them for an application's implementation without modifying ADO-level correctness proofs. 
    more » « less