skip to main content


Title: Scaling beyond packet switch limits with multiple dataplanes
Scale-out datacenter network fabrics enable network operators to translate improved link and switch speeds directly into end-host throughput. Unfortunately, limits in the underlying CMOS packet switch chip manufacturing roadmap mean that NICs, links, and switches are not getting faster fast enough to meet demand. As a result, operators have introduced alternative, parallel fabric designs in the core of the network that deliver N-times the bandwidth by simply forwarding traffic over any of N parallel network fabrics. In this work, we consider extending this parallel network idea all the way to the end host. Our initial impressions found that direct application of existing path selection and forwarding techniques resulted in poor performance. Instead, we show that appropriate path selection and forwarding protocols can not only improve the performance of existing, homogeneous parallel fabrics, but enable the development of heterogeneous parallel network fabrics that can deliver even higher bandwidth, lower latency, and improved resiliency than traditional designs constructed from the same constituent components.  more » « less
Award ID(s):
1911104
NSF-PAR ID:
10458427
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
Page Range / eLocation ID:
214 to 231
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. New international academic collaborations are being created at a fast pace, generating data sets each day, in the order of terabytes in size. Often these data sets need to be moved in real-time to a central location to be processed and then shared. In the field of astronomy, building data processing facilities in remote locations is not always feasible, creating the need for a high bandwidth network infrastructure to transport these data sets very long distances. This network infrastructure normally relies on multiple networks operated by multiple organizations or projects. Creating an end-to-end path involving multiple network operators, technologies and interconnections often adds conditions that make the real-time movement of big data sets challenging. The Large Synoptic Survey Telescope (LSST) is an example of astronomical applications imposing new challenges on multi-domain network provisioning activities. The network for LSST is challenging for a number of reasons: (1) with the telescope in Chile and the archiving facility in the USA, the network has a high propagation delay, which affects traditional transport protocols performance; (2) the path is composed of multiple network operators, which means that the different network operating teams involved must coordinate technologies and protocols to support all parallel data transfers in an efficient way; (3) the large amount of data produced (12.7GB/image) and the small interval available to transfer this data (5 seconds) to the archiving facility requires special Quality of Service (QoS) policies; (4) because network events happen, the network needs to be prepared to be adjusted for rainy days, where some data types will be prioritized over others. To guarantee data transfers will happen within the required interval, each network operator in the path needs to apply QoS policies to each of its network links. These policies need to be coordinated end-to-end and, in the case where the network is affected by parallel events, all policies might need to be dynamically reconfigured in real-time to accommodate specific QoS policies for rainy days. Reconfiguring QoS policies is a very complex activity to current network protocols and technologies, sometimes requiring human intervention. This presentation aims to share the efforts to guarantee an efficient network configuration capable of handling LSST data transfers in sunny and rainy days across multiple network operators from South to North America. 
    more » « less
  2. In this paper we propose a novel approach to deliver better delay-jitter performance in dynamic networks. Dynamic networks experience rapid and unpredictable fluctuations and hence, a certain amount of uncertainty about the delay-performance of various network elements is unavoidable. This uncertainty makes it difficult for network operators to guarantee a certain quality of service (in terms of delay and jitter) to users. The uncertainty about the state of the network is often overlooked to simplify problem formulation, but we capture it by modeling the delay on various links as general and potentially correlated random processes. Within this framework, a user will request a certain delay-jitter performance guarantee from the network. After verifying the feasibility of the request, the network will respond to the user by specifying a set of routes as well as the proportion of traffic which should be sent through each one to achieve the desired QoS. We propose to use mean-variance analysis as the basis for traffic distribution and route selection, and show that this technique can significantly reduce the end-to-end jitter because it accounts for the correlated nature of delay across different paths. The resulting traffic distribution is often non-uniform and the fractional flow on each path is the solution to a simple convex optimization problem. We conclude the paper by commenting on the potential application of this method to general transportation networks. 
    more » « less
  3. The 5G user plane function (UPF) is a critical inter-connection point between the data network and cellular network infrastructure. It governs the packet processing performance of the 5G core network. UPFs also need to be flexible to support several key control plane operations. Existing UPFs typically run on general-purpose CPUs, but have limited performance because of the overheads of host-based forwarding. We design Synergy, a novel 5G UPF running on SmartNICs that provides high throughput and low latency. It also supports monitoring functionality to gather critical data on user sessions for the prediction and optimization of handovers during user mobility. The SmartNIC UPF efficiently buffers data packets during handover and paging events by using a two-level flow-state access mechanism. This enables maintaining flow-state for a very large number of flows, thus providing very low latency for control and data planes and high throughput packet forwarding. Mobility prediction can reduce the handover delay by pre-populating state in the UPF and other core NFs. Synergy performs handover predictions based on an existing recurrent neural network model. Synergy's mobility predictor helps us achieve 2.32× lower average handover latency. Buffering in the SmartNIC, rather than the host, during paging and handover events reduces packet loss rate by at least 2.04×. Compared to previous approaches to building programmable switch-based UPFs, Synergy speeds up control plane operations such as handovers because of the low P4-programming latency leveraging tight coupling between SmartNIC and host. 
    more » « less
  4. Operators in multi-tenant cloud datacenters require support for diverse and complex end-to-end policies, such as, reachability, middlebox traversals, isolation, traffic engineering, and network resource management. We present Genesis, a datacenter network management system which allows policies to be specified in a declarative manner without explicitly programming the network data plane. Genesis tackles the problem of enforcing policies by synthesizing switch forwarding tables. It uses the formal foundations of constraint solving in combination with fast off-the-shelf SMT solvers. To improve synthesis performance, Genesis incorporates a novel search strategy that uses regular expressions to specify properties that leverage the structure of datacenter networks, and a divide-and-conquer synthesis procedure which exploits the structure of policy relationships. We have prototyped Genesis, and conducted experiments with a variety of workloads on real-world topologies to demonstrate its performance. 
    more » « less
  5. null (Ed.)
    The Jellyfish network has recently been proposed as an alternative to the fat-tree network for data centers and high-performance computing clusters. Jellyfish uses a random regular graph as its switch-level topology and has shown to be more cost-effective than fat-trees. Effective routing on Jellyfish is challenging. It is known that shortest path routing and equal cost multi-path routing (ECMP) do not work well on Jellyfish. Existing schemes use variations of k-shortest path routing (KSP). In this work, we study two routing components for Jellyfish: path selection that decides the paths to route traffic, and routing mechanisms that decide which path to be used for each packet. We show that the performance of the existing KSP can be significantly improved by incorporating two heuristics, randomization and edge-disjointness. We evaluate a range of routing mechanisms, including traffic oblivious and traffic adaptive schemes, and identify an adaptive routing scheme with noticeably higher performance than others. 
    more » « less