Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations.
more »
« less
When Does Diversity of Agent Preferences Improve Outcomes in Selfish Routing?
We seek to understand when heterogeneity in agent preferences yields improved outcomes in terms of overall cost. That this might be hoped for is based on the common belief that diversity is advantageous in many multi-agent settings. We investigate this in the context of routing. Our main result is a sharp characterization of the network settings in which diversity always helps, versus those in which it is sometimes harmful.Specifically, we consider routing games, where diversity arises in the way that agents trade-off two criteria (such as time and money, or, in the case of stochastic delays, expectation and variance of delay). Our main contributions are: 1) A participant-oriented measure of cost in the presence of agent diversity; 2) A full characterization of those network topologies for which diversity always helps, for all latency functions and demands.
more »
« less
- Award ID(s):
- 1733832
- PAR ID:
- 10076956
- Date Published:
- Journal Name:
- Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018)
- Page Range / eLocation ID:
- 173 to 179
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
New mobility concepts are at the forefront of research and innovation in smart cities. The introduction of connected and autonomous vehicles enables new possibilities in vehicle routing. Specifically, knowing the origin and destination of each agent in the network can allow for real-time routing of the vehicles to optimize network performance. However, this relies on individual vehicles being "altruistic" i.e., being willing to accept an alternative non-preferred route in order to achieve a network-level performance goal. In this work, we conduct a study to compare different levels of agent altruism and the resulting effect on the network-level traffic performance. Specifically, this study compares the effects of different underlying urban structures on the overall network performance, and investigates which characteristics of the network make it possible to realize routing improvements using a decentralized optimization router. The main finding is that, with increased vehicle altruism, it is possible to balance traffic flow among the links of the network. We show evidence that the decentralized optimization router is more effective with networks of high load while we study the influence of cities characteristics, in particular: networks with a higher number of nodes (intersections) or edges (roads) per unit area allow for more possible alternate routes, and thus higher potential to improve network performance.more » « less
-
This paper develops a general framework to study how misinterpreting information impacts learning. Our main result is a simple criterion to characterize long‐run beliefs based on the underlying form of misspecification. We present this characterization in the context of social learning, then highlight how it applies to other learning environments, including individual learning. A key contribution is that our characterization applies to settings with model heterogeneity and provides conditions for entrenched disagreement. Our characterization can be used to determine whether a representative agent approach is valid in the face of heterogeneity, study how differing levels of bias or unawareness of others' biases impact learning, and explore whether the impact of a bias is sensitive to parametric specification or the source of information. This unified framework synthesizes insights gleaned from previously studied forms of misspecification and provides novel insights in specific applications, as we demonstrate in settings with partisan bias, overreaction, naive learning, and level‐k reasoning.more » « less
-
Chan, Timothy; Fischer, Johannes; Iacono, John; Herman, Grzegorz (Ed.)We consider two-cost network design models in which edges of the input graph have an associated cost and length. We build upon recent advances in hop-constrained oblivious routing to obtain two sets of results. We address multicommodity buy-at-bulk network design in the nonuniform setting. Existing poly-logarithmic approximations are based on the junction tree approach [Chekuri et al., 2010; Guy Kortsarz and Zeev Nutov, 2011]. We obtain a new polylogarithmic approximation via a natural LP relaxation. This establishes an upper bound on its integrality gap and affirmatively answers an open question raised in [Chekuri et al., 2010]. The rounding is based on recent results in hop-constrained oblivious routing [Ghaffari et al., 2021], and this technique yields a polylogarithmic approximation in more general settings such as set connectivity. Our algorithm for buy-at-bulk network design is based on an LP-based reduction to h-hop constrained network design for which we obtain LP-based bicriteria approximation algorithms. We also consider a fault-tolerant version of h-hop constrained network design where one wants to design a low-cost network to guarantee short paths between a given set of source-sink pairs even when k-1 edges can fail. This model has been considered in network design [Luis Gouveia and Markus Leitner, 2017; Gouveia et al., 2018; Arslan et al., 2020] but no approximation algorithms were known. We obtain polylogarithmic bicriteria approximation algorithms for the single-source setting for any fixed k. We build upon the single-source algorithm and the junction-tree approach to obtain an approximation algorithm for the multicommodity setting when at most one edge can fail.more » « less
-
null (Ed.)The Jellyfish network has recently been proposed as an alternative to the fat-tree network for data centers and high-performance computing clusters. Jellyfish uses a random regular graph as its switch-level topology and has shown to be more cost-effective than fat-trees. Effective routing on Jellyfish is challenging. It is known that shortest path routing and equal cost multi-path routing (ECMP) do not work well on Jellyfish. Existing schemes use variations of k-shortest path routing (KSP). In this work, we study two routing components for Jellyfish: path selection that decides the paths to route traffic, and routing mechanisms that decide which path to be used for each packet. We show that the performance of the existing KSP can be significantly improved by incorporating two heuristics, randomization and edge-disjointness. We evaluate a range of routing mechanisms, including traffic oblivious and traffic adaptive schemes, and identify an adaptive routing scheme with noticeably higher performance than others.more » « less
An official website of the United States government

