skip to main content


Title: Modular Control Plane Verification via Temporal Invariants
Monolithic control plane verification cannot scale to hyperscale network architectures with tens of thousands of nodes, heterogeneous network policies and thousands of network changes a day. Instead, modular verification offers improved scalability, reasoning over diverse behaviors, and robustness following policy updates. We introduce Timepiece, a new modular control plane verification system. While one class of verifiers, starting with Minesweeper, were based on analysis of stable paths, we show that such models, when deployed naïvely for modular verification, are unsound. To rectify the situation, we adopt a routing model based around a logical notion of time and develop a sound, expressive, and scalable verification engine. Our system requires that a user specifies interfaces between module components. We develop methods for defining these interfaces using predicates inspired by temporal logic, and show how to use those interfaces to verify a range of network-wide properties such as reachability or access control. Verifying a prefix-filtering policy using a non-modular verification engine times out on an 80-node fattree network after 2 hours. However, Timepiece verifies a 2,000-node fattree in 2.37 minutes on a 96-core virtual machine. Modular verification of individual routers is embarrassingly parallel and completes in seconds, which allows verification to scale beyond non-modular engines, while still allowing the full power of SMT-based symbolic reasoning.  more » « less
Award ID(s):
2107138 1837030
NSF-PAR ID:
10427347
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
7
Issue:
PLDI
ISSN:
2475-1421
Page Range / eLocation ID:
50 to 75
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Satisfiability Modulo Theories (SMT)-based analysis allows exhaustive reasoning over complex distributed control plane routing behaviors, enabling verification of converged routing states under arbitrary conditions. To improve scalability of SMT solving, we introduce a modular verification approach to network control plane verification, where we cut a network into smaller fragments. Users specify an annotated cut which describes how to generate these fragments from the monolithic network, and we verify each fragment independently, using these annotations to define assumptions and guarantees over fragments akin to assume-guarantee reasoning. We prove that any converged states of the fragments are converged states of the monolithic network, and there exists an annotated cut that can generate fragments corresponding to any converged state of the monolithic network. We implement this procedure as Kirigami, an extension of the network verification language and tool NV, and evaluate it on industrial topologies with synthesized policies. We observe a 10x improvement in end-to-end NV verification time, with SMT solve time improving by up to 6 orders of magnitude. 
    more » « less
  2. This paper presents McNetKAT, a scalable tool for verifying probabilistic network programs. McNetKAT is based on a new semantics for the guarded and history-free fragment of Probabilistic NetKAT in terms of finite-state, absorbing Markov chains. This view allows the semantics of all programs to be computed exactly, enabling construction of an automatic verification tool. Domain-specific optimizations and a parallelizing backend enable McNetKAT to analyze networks with thousands of nodes, automatically reasoning about general properties such as probabilistic program equivalence and refinement, as well as networking properties such as resilience to failures. We evaluate McNetKAT's scalability using real-world topologies, compare its performance against state-of-the-art tools, and develop an extended case study on a recently proposed data center network design. 
    more » « less
  3. Reasoning about storage systems is challenging because these systems make persistence guarantees even if the system crashes at any point. To achieve these crash-safety guarantees, storage systems include recovery procedures to restore the system to a consistent state after a crash. Moreover, large-scale systems are structured as multiple stacked layers and can require recovery at multiple layers of abstraction. Formal verification can ensure that crash-safety guarantees hold regardless of when the system crashes. To make verification tractable, large-scale systems should be verified in a modular fashion, layer-by-layer in the software stack. Layered recovery makes modularity challenging because the system can crash in the middle of a high-level recovery procedure and must start over from the low-level recovery procedure. We present Argosy, a framework for machine-checked proofs of storage systems that supports layered recovery implementations with modular proofs. The framework is based on combinators for transition relations that are inspired by Kleene algebra, which provides a convenient formalism for specifying and reasoning about crashes and recovery. On top of this framework, we implement Crash Hoare Logic (CHL), the program logic used by FSCQ. Using the logic, we have verified an example of layered recovery featuring a write-ahead log on top of a disk, which itself runs by replicating over two unreliable disks. The metatheory of the framework, the soundness of the program logic, and these examples are all verified in the Coq theorem prover. 
    more » « less
  4. The lack of large-scale, continuously evolving empirical data usually limits the study of networks to the analysis of snapshots in time. This approach has been used for verification of network evolution mechanisms, such as preferential attachment. However, these studies are mostly restricted to the analysis of the first links established by a new node in the network and typically ignore connections made after each node’s initial introduction. Here, we show that the subsequent actions of individuals, such as their second network link, are not random and can be decoupled from the mechanism behind the first network link. We show that this feature has strong influence on the network topology. Moreover, snapshots in time can now provide information on the mechanism used to establish the second connection. We interpret these empirical results by introducing the “propinquity model,” in which we control and vary the distance of the second link established by a new node and find that this can lead to networks with tunable density scaling, as found in real networks. Our work shows that sociologically meaningful mechanisms are influencing network evolution and provides indications of the importance of measuring the distance between successive connections. 
    more » « less
  5. Burst-parallel serverless applications invoke thousands of short-lived distributed functions to complete complex jobs such as data analytics, video encoding, or compilation. While these tasks execute in seconds, starting and configuring the virtual network they rely on is a major bottleneck that can consume up to 84% of total startup time. In this paper we characterize the magnitude of this network cold start problem in three popular overlay networks, Docker Swarm, Weave, and Linux Overlay. We focus on end-to-end startup time that encompasses both the time to boot a group of containers as well as interconnecting them. Our primary observation is that existing overlay approaches for serverless networking scale poorly in short-lived serverless environments. Based on our findings we develop Particle, a network stack tailored for multi-node serverless overlay networks that optimizes network creation without sacrificing multi-tenancy, generality, or throughput. When integrated into a serverless burst-parallel video processing pipeline, Particle improves application runtime by 2.4--3X over existing overlays. 
    more » « less