skip to main content


Title: Multi-Path Routing in the Jellyfish Network
The Jellyfish network has recently been proposed as an alternative to the fat-tree network for data centers and high-performance computing clusters. Jellyfish uses a random regular graph as its switch-level topology and has shown to be more cost-effective than fat-trees. Effective routing on Jellyfish is challenging. It is known that shortest path routing and equal cost multi-path routing (ECMP) do not work well on Jellyfish. Existing schemes use variations of k-shortest path routing (KSP). In this work, we study two routing components for Jellyfish: path selection that decides the paths to route traffic, and routing mechanisms that decide which path to be used for each packet. We show that the performance of the existing KSP can be significantly improved by incorporating two heuristics, randomization and edge-disjointness. We evaluate a range of routing mechanisms, including traffic oblivious and traffic adaptive schemes, and identify an adaptive routing scheme with noticeably higher performance than others.  more » « less
Award ID(s):
1822737 2007827 1738912
NSF-PAR ID:
10231738
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IPDPS workshop on Scalable Networks for Advanced Computer Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Oblivious routing of network traffic uses predetermined paths that do not change with changing traffic patterns. It has the benefit of using a fixed network configuration while robustly handling a range of varying and unpredictable traffic. Theoretical advances have shown that the benefits of oblivious routing are achievable without compromising much capacity efficiency. For oblivious routing, we only assume knowledge of the ingress/egress capacities of the edge nodes through which traffic enters or leaves the network. All traffic patterns possible subject to the ingress/egress capacity constraints (also known as the hose constraints) are permissible and are to be handled using oblivious routing. We use the widely deployed segment routing method for route control. Furthermore, for ease of deployment and to not deviate too much from conventional shortest path routing, we restrict paths to be 2-segment paths (the composition of two shortest path routed segments). We solve the 2-segment oblivious routing problem for all permissible traffic matrices (which can be infinitely-many).We develop a new adversarial and machine-learning driven approach that uses an iterative gradient descent method to solve the routing problem with worst-case performance guarantees. Additionally, the parallelism involved in descent methods allows this method to scale well with the network size making it amenable for use in practice. 
    more » « less
  2. Valiant routing, the use of a random intermediate node to distribute network traffic, has been proposed for a number of recent HPC network topologies. It is also commonly used as a bulding block for adaptive routing algorithms, which use shortest path routes when possible, but revert to Valiant routing when necessary to avoid hot spots. We show that the version of Valiant routing proposed for the Slim fly topology can cause messages to follow loops, using an edge in both directions before returning to edges of the original shortest path. Removing these loops in the UGAL-L adaptive routing algorithm is shown to provide slight improvements in average latency and also allow the network to carry up to 12% more traffic before saturation. 
    more » « less
  3. Oblivious routing has a long history in both the theory and practice of networking. In this work we initiate the formal study of oblivious routing in the context of reconfigurable networks, a new architecture that has recently come to the fore in datacenter networking. These networks allow a rapidly changing bounded-degree pattern of interconnections between nodes, but the network topology and the selection of routing paths must both be oblivious to the traffic demand matrix. Our focus is on the trade-off between maximizing throughput and minimizing latency in these networks. For every constant throughput rate, we characterize (up to a constant factor) the minimum latency achievable by an oblivious reconfigurable network design that satisfies the given throughput guarantee. The trade-off between these two objectives turns out to be surprisingly subtle: the curve depicting it has an unexpected scalloped shape reflecting the fact that load-balancing becomes more difficult when the average length of routing paths is not an integer because equalizing all the path lengths is not possible. The proof of our lower bound uses LP duality to verify that Valiant load balancing is the most efficient oblivious routing scheme when used in combination with an optimally-designed reconfigurable network topology. The proof of our upper bound uses an algebraic construction in which the network nodes are identified with vectors over a finite field, the network topology is described by either the elementary basis or a sequence of Vandermonde matrices, and routing paths are constructed by selecting columns of these matrices to yield the appropriate mixture of path lengths within the shortest possible time interval. 
    more » « less
  4. Service function chaining (SFC), consisting of a sequence of virtual network functions (VNFs) (i.e., firewalls and load balancers), is an effective service provision technique in modern data center networks. By requiring cloud user traffic to traverse the VNFs in order, SFC im- proves the security and performance of the cloud user applications. In this paper, we study how to place an SFC inside a data center to mini- mize the network traffic of the virtual machine (VM) communication. We take a cooperative multi-agent reinforcement learning approach, wherein multiple agents collaboratively figure out the traffic-efficient route for the VM communication. Underlying the SFC placement is a fundamental graph-theoretical prob- lem called the k-stroll problem. Given a weighted graph G(V, E), two nodes s, t ∈ V , and an integer k, the k-stroll problem is to find the shortest path from s to t that visits at least k other nodes in the graph. Our work is the first to take a multi-agent learning approach to solve k- stroll problem. We compare our learning algorithm with an optimal and exhaustive algorithm and an existing dynamic programming(DP)-based heuristic algorithm. We show that our learning algorithm, although lack- ing the complete knowledge of the network assumed by existing research, delivers comparable or even better VM communication time while taking two orders of magnitude of less execution time. 
    more » « less
  5. null (Ed.)
    Urban street networks are subject to a variety of random disruptions. The impact of movement restrictions (e.g., one-way or left-turn restrictions) on the ability of a network to overcome these disruptions—that is, its resilience—has not been thoroughly studied. To address this gap, this paper investigates the resilience of one-way and two-way square grid street networks with and without left turns under light traffic conditions. Networks are studied using a simplified routing algorithm that can be examined analytically and a microsimulation that describes detailed vehicle dynamics. In the simplified method, routing choices are enumerated for all possible origin–destination (OD) combinations to identify how the removal of a link affects operations, both when knowledge of the disruption is and is not available at the vehicle’s origin. Disruptions on two-way networks that allow left turns tend to have little impact on travel distances because of the availability of multiple shortest paths between OD pairs and the flexibility in route modification. Two-way networks that restrict left turns at intersections only have a single shortest-distance path between any OD pair and thus experience larger increases in travel distance, even when the disruption is known ahead of time. One-way networks sometimes have multiple shortest-distance routes and thus travel distances increase less than two-way network without left turns when links are disrupted. These results reveal a clear tradeoff between improved efficiency and reduced resilience for networks that have movement restrictions, and can be used as a basis to study network resilience under more congested scenarios and in more realistic network structures. 
    more » « less