Oblivious routing of network traffic uses predetermined
paths that do not change with changing traffic
patterns. It has the benefit of using a fixed network configuration
while robustly handling a range of varying and unpredictable
traffic. Theoretical advances have shown that the benefits of oblivious
routing are achievable without compromising much capacity
efficiency. For oblivious routing, we only assume knowledge of
the ingress/egress capacities of the edge nodes through which
traffic enters or leaves the network. All traffic patterns possible
subject to the ingress/egress capacity constraints (also known as
the hose constraints) are permissible and are to be handled using
oblivious routing. We use the widely deployed segment routing
method for route control. Furthermore, for ease of deployment
and to not deviate too much from conventional shortest path
routing, we restrict paths to be 2-segment paths (the composition
of two shortest path routed segments). We solve the 2-segment
oblivious routing problem for all permissible traffic matrices
(which can be infinitely-many).We develop a new adversarial and
machine-learning driven approach that uses an iterative gradient
descent method to solve the routing problem with worst-case
performance guarantees. Additionally, the parallelism involved
in descent methods allows this method to scale well with the
network size making it amenable for use in practice.
more »
« less
Multi-Path Routing in the Jellyfish Network
The Jellyfish network has recently been proposed as an alternative to the fat-tree network for data centers and high-performance computing clusters. Jellyfish uses a random regular graph as its switch-level topology and has shown to be more cost-effective than fat-trees. Effective routing on Jellyfish
is challenging. It is known that shortest path routing and equal cost multi-path routing (ECMP) do not work well on Jellyfish. Existing schemes use variations of k-shortest path routing (KSP). In this work, we study two routing components for
Jellyfish: path selection that decides the paths to route traffic, and routing mechanisms that decide which path to be used for each packet. We show that the performance of the existing KSP can be significantly improved by incorporating two heuristics, randomization and edge-disjointness. We evaluate a range of routing mechanisms, including traffic oblivious and traffic adaptive schemes, and identify an adaptive routing scheme with noticeably higher performance than others.
more »
« less
- NSF-PAR ID:
- 10231738
- Date Published:
- Journal Name:
- IPDPS workshop on Scalable Networks for Advanced Computer Systems
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Valiant routing, the use of a random intermediate node to distribute network traffic, has been proposed for a number of recent HPC network topologies. It is also commonly used as a bulding block for adaptive routing algorithms, which use shortest path routes when possible, but revert to Valiant routing when necessary to avoid hot spots. We show that the version of Valiant routing proposed for the Slim fly topology can cause messages to follow loops, using an edge in both directions before returning to edges of the original shortest path. Removing these loops in the UGAL-L adaptive routing algorithm is shown to provide slight improvements in average latency and also allow the network to carry up to 12% more traffic before saturation.more » « less
-
Oblivious routing has a long history in both the theory and practice of networking. In this work we initiate the formal study of oblivious routing in the context of reconfigurable networks, a new architecture that has recently come to the fore in datacenter networking. These networks allow a rapidly changing bounded-degree pattern of interconnections between nodes, but the network topology and the selection of routing paths must both be oblivious to the traffic demand matrix. Our focus is on the trade-off between maximizing throughput and minimizing latency in these networks. For every constant throughput rate, we characterize (up to a constant factor) the minimum latency achievable by an oblivious reconfigurable network design that satisfies the given throughput guarantee. The trade-off between these two objectives turns out to be surprisingly subtle: the curve depicting it has an unexpected scalloped shape reflecting the fact that load-balancing becomes more difficult when the average length of routing paths is not an integer because equalizing all the path lengths is not possible. The proof of our lower bound uses LP duality to verify that Valiant load balancing is the most efficient oblivious routing scheme when used in combination with an optimally-designed reconfigurable network topology. The proof of our upper bound uses an algebraic construction in which the network nodes are identified with vectors over a finite field, the network topology is described by either the elementary basis or a sequence of Vandermonde matrices, and routing paths are constructed by selecting columns of these matrices to yield the appropriate mixture of path lengths within the shortest possible time interval.more » « less
-
Service function chaining (SFC), consisting of a sequence of virtual network functions (VNFs) (i.e., firewalls and load balancers), is an effective service provision technique in modern data center networks. By requiring cloud user traffic to traverse the VNFs in order, SFC im- proves the security and performance of the cloud user applications. In this paper, we study how to place an SFC inside a data center to mini- mize the network traffic of the virtual machine (VM) communication. We take a cooperative multi-agent reinforcement learning approach, wherein multiple agents collaboratively figure out the traffic-efficient route for the VM communication. Underlying the SFC placement is a fundamental graph-theoretical prob- lem called the k-stroll problem. Given a weighted graph G(V, E), two nodes s, t ∈ V , and an integer k, the k-stroll problem is to find the shortest path from s to t that visits at least k other nodes in the graph. Our work is the first to take a multi-agent learning approach to solve k- stroll problem. We compare our learning algorithm with an optimal and exhaustive algorithm and an existing dynamic programming(DP)-based heuristic algorithm. We show that our learning algorithm, although lack- ing the complete knowledge of the network assumed by existing research, delivers comparable or even better VM communication time while taking two orders of magnitude of less execution time.more » « less
-
null (Ed.)Urban street networks are subject to a variety of random disruptions. The impact of movement restrictions (e.g., one-way or left-turn restrictions) on the ability of a network to overcome these disruptions—that is, its resilience—has not been thoroughly studied. To address this gap, this paper investigates the resilience of one-way and two-way square grid street networks with and without left turns under light traffic conditions. Networks are studied using a simplified routing algorithm that can be examined analytically and a microsimulation that describes detailed vehicle dynamics. In the simplified method, routing choices are enumerated for all possible origin–destination (OD) combinations to identify how the removal of a link affects operations, both when knowledge of the disruption is and is not available at the vehicle’s origin. Disruptions on two-way networks that allow left turns tend to have little impact on travel distances because of the availability of multiple shortest paths between OD pairs and the flexibility in route modification. Two-way networks that restrict left turns at intersections only have a single shortest-distance path between any OD pair and thus experience larger increases in travel distance, even when the disruption is known ahead of time. One-way networks sometimes have multiple shortest-distance routes and thus travel distances increase less than two-way network without left turns when links are disrupted. These results reveal a clear tradeoff between improved efficiency and reduced resilience for networks that have movement restrictions, and can be used as a basis to study network resilience under more congested scenarios and in more realistic network structures.more » « less