skip to main content


Title: Managing State for Failure Resiliency in Network Function Virtualization
Ensuring high scalability (elastic scale-out and consolidation), as well as high availability (failure resiliency) are critical in encouraging adoption of software-based network functions (NFs). In recent years, two paradigms have evolved in terms of the way the NFs manage their state - namely the Stateful (state is coupled with the NF instance) and a Stateless (state is externalized to a datastore) manner. These two paradigms present unique challenges and opportunities for ensuring high scalability and high availability of NFs and NF chains. In this work, we assess the impact on ensuring the correctness of NF state including the implications of non-determinism in packet processing, and carefully analyze and present the benefits and disadvantages of the two state management paradigms. We leverage OpenNetVM and Redis in-memory datastore to implement both state management paradigms and empirically compare the two. Although the stateless paradigm is desirable for elastic scaling, our experimental results show that, even at line-rate packet processing (10 Gbps), stateful NFs can achieve chain-level failover across servers in a LAN incurring less than 10% performance. The state-of-the-art stateless counterparts incur severe throughput penalties. We observe 30-85% overhead on normal processing, depending on the mode of state updated to the externalized datastore.  more » « less
Award ID(s):
1823236 1763548 1823270
NSF-PAR ID:
10189820
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ensuring high availability (HA) for software-based networks is a critical design feature that will help the adoption of software-based network functions (NFs) in production networks. It is important for NFs to avoid outages and maintain mission-critical operations. However, HA support for NFs on the critical data path can result in unacceptable performance degradation. We present REINFORCE, an integrated framework to support efficient resiliency for NFs and NF service chains. REINFORCE includes timely failure detection and consistent failover mechanisms. REINFORCE replicates state to standby NFs (local and remote) while enforcing correctness. It minimizes the number of state transfers by exploiting the concept of external synchrony, and leverages opportunistic batching and multi-buffering to optimize performance. Experimental results show that, even at line-rate packet processing (10 Gbps), REINFORCE achieves chain-level failover across servers in a LAN (or within the same node) within 10ms (100/μs), incurring less than 10% (1%) performance overhead, and adds average latency of only ~400/μs (5/μs), with a worst-case latency of less than 1ms (10/μs). 
    more » « less
  2. Elastic scaling is a central promise of NFV but has been hard to realize in practice. The difficulty arises because most Network Functions (NFs) are stateful and this state need to be shared across NF instances. Implementing state sharing while meeting the throughput and latency requirements placed on NFs is challenging and, to date, no solution exists that meets NFV’s performance goals for the full spectrum of NFs. S6 is a new framework that supports elastic scaling of NFs without compromising performance. Its design builds on the insight that a distributed shared state abstraction is well-suited to the NFV context. We organize state as a distributed shared object (DSO) space and extend the DSO concept with techniques designed to meet the need for elasticity and high-performance in NFV workloads. S6 simplifies development: NF writers program with no awareness of how state is distributed and shared. Instead, S6 transparently migrates state and handles accesses to shared state. In our evaluation, compared to recent solutions for dynamic scaling of NFs, S6 improves performance by 100x during scaling events [25], and by 2-5x under normal operation 
    more » « less
  3. Developers are always on the lookout for simple solutions to manage their applications on cloud platforms. Major cloud providers have already been offering automatic elasticity management solutions (e.g., AWS Lambda, Azure durable function) to users. However, many cloud applications are stateful --- while executing, functions need to share their state with others. Providing elasticity for such stateful functions is much more challenging, as a deployment/elasticity decision for a stateful entity can strongly affect others in ways which are hard to predict without any application knowledge. Existing solutions either only support stateless applications (e.g., AWS Lambda) or only provide limited elasticity management (e.g., Azure durable function) to stateful applications. PLASMA (Programmable Elasticity for Stateful Cloud Computing Applications) is a programming framework for elastic stateful cloud applications. It includes (1) an elasticity programming language as a second "level" of programming (complementing the main application programming language) for describing elasticity behavior, and (2) a novel semantics-aware elasticity management runtime that tracks program execution and acts upon application features as suggested by elasticity behavior. We have implemented 10+ applications with PLASMA. Extensive evaluation on Amazon AWS shows that PLASMA significantly improves their efficiency, e.g., achieving same performance as a vanilla setup with 25% fewer resources, or improving performance by 40% compared to the default setup. 
    more » « less
  4. null (Ed.)
    Core-Stateless Fair Queueing (CSFQ) is a scalable algorithm proposed more than two decades ago to achieve fair queueing without keeping per-flow state in the network. Unfortunately, CSFQ did not take off, in part because it required protocol changes (i.e., adding new fields to the packet header), and hardware support to process packets at line rate. In this paper, we argue that two emerging trends are making CSFQ relevant again: (1) cloud computing which makes it feasible to change the protocol within the same datacenter or across datacenters owned by the same provider, and (2) programmable switches which can implement sophisticated packet processing at line rate. To this end, we present the first realization of CSFQ using programmable switches. In addition, we generalize CSFQ to a multi-level hierarchy, which naturally captures the traffic in today's datacenters, e.g., tenants at the first level and flows of each tenant at the second level of the hierarchy. We call this scheduler Hierarchical Core-Stateless Fair Queueing (HCSFQ), and show that it is able to accurately approximate hierarchical fair queueing. HCSFQ is highly scalable: it uses just a single FIFO queue, does not perform per-packet scheduling, and only needs to maintain state for the interior nodes of the hierarchy. We present analytical results to prove the lower bounds of HCSFQ. Our testbed experiments and large-scale simulations show that CSFQ and HCSFQ can provide fair bandwidth allocation and ensure isolation. 
    more » « less
  5. Software Defined Networking (SDN) and Network Function Virtualization (NFV) are transforming Data Center (DC), Telecom, and enterprise networking. The programmability offered by P4 enables SDN to be more protocol-independent and flexible. Data Centers are increasingly adopting SmartNICs (sNICs) to accelerate packet processing that can be leveraged to support packet processing pipelines and custom Network Functions (NFs). However, there are several challenges in integrating and deploying P4 based SDN control as well as host and sNIC-based programmable NFs. These include configuration and management of the data plane components (Host and sNIC P4 switches) for the SDN control plane and effective utilization of data plane resources. P4NFV addresses these concerns and provides a unified P4 switch abstraction framework to simplify the SDN control plane, reducing management complexities, and leveraging a host-local SDN Agent to improve the overall resource utilization. The SDN agent considers the network-wide, host, and sNIC specific capabilities and constraints. Based on workload and traffic characteristics, P4NFV determines the partitioning of the P4 tables and optimal placement of NFs (P4 actions) to minimize the overall delay and maximize resource utilization. P4NFV uses Mixed Integer Linear Programming (MILP) based optimization formulation and achieves up to 2. 5X increase in system capacity while minimizing the delay experienced by flows. P4NFV considers the number of packet exchanges, flow size, and state dependency to minimize the delay imposed by data transmission over PCI Express interface. 
    more » « less