skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: FORTIFY: Software Defined Data Plane Resilience
Given the scale and mission-critical nature of production networks today, it is essential to solidify their resilience to link failures. Building this resilience in each application separately is not scalable. In order to minimize downtime, at least some degree of resilience should be built directly into the data plane. Fast Failover groups in OpenFlow offer a mechanism to achieve this, but programming them introduces additional complexity to the existing arduous task of developing an SDN controller application. In this paper, we discuss how this complexity can be decoupled from the controller implementation. We introduce FORTIFY, a transparent resiliency layer that incorporates data plane fault tolerance into any existing controller application without any modification to it. FORTIFY operates as a shim layer between the controller and the data plane, and dynamically transforms the data plane rules computed by the controller to use Fast Failover groups. FORTIFY can be used off-The-shelf, or customized programmatically to choose specific types of backup paths. Experimental results collected on a production testbed demonstrate that FORTIFY is a practical, high-performance solution to data plane fault tolerance in SDNs.  more » « less
Award ID(s):
2113819
PAR ID:
10427017
Author(s) / Creator(s):
; ; ; ; ; ; ;
Editor(s):
Larry Horner, Kurt Tutschku
Date Published:
Journal Name:
2022 IEEE Conference on Network Function Virtualization and Software Defined Networks, NFV-SDN 2022
Page Range / eLocation ID:
6 to 12
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity. 
    more » « less
  2. Power grids are undergoing major changes due to rapid growth in renewable energy and improvements in battery technology. Prompted by the increasing complexity of power systems, decentralized IoT solutions are emerging, which arrange local communities into transactive microgrids. The core functionality of these solutions is to provide mechanisms for matching producers with consumers while ensuring system safety. However, there are multiple challenges that these solutions still face: privacy, trust, and resilience. The privacy challenge arises because the time series of production and consumption data for each participant is sensitive and may be used to infer personal information. Trust is an issue because a producer or consumer can renege on the promised energy transfer. Providing resilience is challenging due to the possibility of failures in the infrastructure that is required to support these market based solutions. In this paper, we develop a rigorous solution for transactive microgrids that addresses all three challenges by providing an innovative combination of MILP solvers, smart contracts, and publish-subscribe middleware within a framework of a novel distributed application platform, called Resilient Information Architecture Platform for Smart Grid. Towards this purpose, we describe the key architectural concepts, including fault tolerance, and show the trade-off between market efficiency and resource requirements. 
    more » « less
  3. This work is an experience with a deployed networked system for digital agriculture (or DA). Digital agriculture is the use of data-driven techniques towards a sustainable increase in farm productivity and efficiency. DA systems are expected to be overlaid on existing rural infrastructures, which are known to be less robust. While existing DA approaches partially address several infrastructure issues, challenges related to data aggregation, data analytics, and fault tolerance remain open. In this work, we present the design of Comosum, an extensible, reconfigurable, and fault-tolerant architecture of hardware, software, and distributed cloud abstractions to sense, analyze, and actuate on different farm types. FarmBIOS is an implementation of the Comosum architecture. We analyze FarmBIOS by leveraging various applications, deployment experiences, and network differences between urban and rural farms. This includes, for instance, an edge analytics application achieving 86% accuracy in vineyard disease detection. An eighteen-month deployment of FarmBIOS highlights Comosum’s fault tolerance. It was fault tolerant to intermittent network outages that lasted for several days during many periods of the deployment. We introduce active digital twins to cope with the unreliability of the underlying base systems. 
    more » « less
  4. null (Ed.)
    The importance of fault tolerance continues to increase for HPC applications. The continued growth in size and complexity of HPC systems, and of the applications them- selves, is leading to an increased likelihood of failures during execution. However, most HPC programming models do not have a built-in fault tolerance mechanism. Instead, application developers usually rely on external support such as application- level checkpoint-restart (C/R) libraries to make their codes fault tolerant. However, this increases the burden on the application developer, who must use the libraries carefully to ensure correct behavior and to minimize the overheads. The C/R routines will be employed to save the values of all needed program variables at the places in the code where they are invoked. It is important for correctness that the program data is in a consistent state at these places. It is non-trivial to determine such points in OpenSHMEM, which relies upon single-sided communications to provide high performance. The amount of data to be collected, and the frequency with which this is performed, must also be carefully tuned, as the overheads introduced by C/R calls can be extremely high. There is very little prior work on checkpoint-restart support in the context of the OpenSHMEM programming interface. In this paper, we introduce OpenSHMEM and describe the challenges it poses for checkpointing. We identify the safest places for inserting C/R calls in an OpenSHMEM program and describe a straightforward approach for identifying the data that needs to be checkpointed at these positions in the code. We provide these two functionalities in a tool that exploits compiler analyses to propose checkpoints and the sets of data for saving at them, to the application developer. 
    more » « less
  5. null (Ed.)
    Fault-tolerant virtual machine (VM) placement refers to the process of placing multiple copies of the same VM cloud application inside cloud data centers. The challenge is how to place required number of VM replicas while minimizing the number of physical machines (PMs) that store them, in order to save energy consumption of cloud data centers. We refer to it as fault-tolerant VM placement problem. In our previous work, we have proposed a greedy algorithm to solve this problem. In this paper, we compare it with an existing research that is based on well-known Welsh Powell Graph-Coloring Algorithm to place items into bins while considering the conflicts between items and items and items and bins. Via extensive simulations, we show that our greedy algorithm can turn off 40-50% more PMs than existing work and can place upto four times as many VM replicas as existing work, achieving much stronger fault-tolerance with less energy consumption. We also compare both algorithms with the optimal integer lin- ear programming (ILP)-based algorithm, which serves as the benchmark of the comparison. 
    more » « less