Dragonfly networks have been proposed to exploit high-radix routers and optical links for high performance computing (HPC) systems. Such networks divide the switches into groups, with a local link between each pair of switches in a group and a global link between each group. Which specific switch serves as the endpoint of each global link is determined by the network’s global link arrangement. We propose two new global link arrangements, each designed using intuition of how to optimize bisection bandwidth when global links have high bandwidth relative to local links. Despite this, the new arrangements generally outperform previously-known arrangements for all bandwidth relationships.
more »
« less
Comparing Global Link Arrangements for Dragonfly Networks
High-performance computing systems are shifting away from traditional interconnect topologies to exploit new technologies and to reduce interconnect power consumption. The Dragonfly topology is one promising candidate for new systems, with several variations already in production. It is hierarchical, with local links forming groups and global links joining the groups. At each level, the interconnect is a clique, with a link between each pair of switches in a group and a link between each pair of groups. This paper shows that the intergroup links can be made in meaningfully different ways. We evaluate three previously- proposed approaches for link organization (called global link arrangements) in two ways. First, we use bisection bandwidth, an important and commonly-used measure of the potential for communication bottlenecks. We show that the global link arrangements often give bisection bandwidths differing by 10s of percent, with the specific separation varying based on the relative bandwidths of local and global links. For the link band- widths used in a current Dragonfly implementation, it is 33%. Second, we show that the choice of global link arrangement can greatly impact the regularity of task mappings for nearest neighbor stencil communication patterns, an important pattern in scientific applications.
more »
« less
- Award ID(s):
- 1423413
- PAR ID:
- 10047243
- Date Published:
- Journal Name:
- 2015 IEEE International Conference on Cluster Computing
- Page Range / eLocation ID:
- 361 to 370
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations.more » « less
-
In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.more » « less
-
Caching at the wireless edge has proven to be a promising approach for efficient video distribution, especially when aided by device-to-device communication. A widely explored scheme is to sub-divide a cell into clusters, and allow one pair of users within each cluster to communicate in each time slot. As more devices are raising frequent requests for popular videos, activating multiple links simultaneously can potentially improve the throughput. However, allowing multiple links at the same time requires to solve the problems of avoiding request clashes, i.e., multiple users requesting transmission from the same caching node, as well as interference management. To address these issues, this paper proposes new designs of both the caching policy and the transmission policy (i.e., link scheduling and power control). Furthermore, the duration of each time slot is optimized to improve the throughput. Finally, some numerical results demonstrate the performance gain of the proposed designs.more » « less
-
Networks provide a powerful formalism for modeling complex sys- tems, by representing the underlying set of pairwise interactions. But much of the structure within these systems involves interac- tions that take place among more than two nodes at once — for example, communication within a group rather than person-to- person, collaboration among a team rather than a pair of co-authors, or biological interaction between a set of molecules rather than just two. We refer to these type of simultaneous interactions on sets of more than two nodes as higher-order interactions; they are ubiquitous, but the empirical study of them has lacked a general framework for evaluating higher-order models. Here we introduce such a framework, based on link prediction, a fundamental prob- lem in network analysis. The traditional link prediction problem seeks to predict the appearance of new links in a network, and here we adapt it to predict which (larger) sets of elements will have fu- ture interactions. We study the temporal evolution of 19 datasets from a variety of domains, and use our higher-order formulation of link prediction to assess the types of structural features that are most predictive of new multi-way interactions. Among our results, we find that different domains vary considerably in their distri- bution of higher-order structural parameters, and that the higher- order link prediction problem exhibits some fundamental differ- ences from traditional pairwise link prediction, with a greater role for local rather than long-range information in predicting the ap- pearance of new interactions.more » « less
An official website of the United States government

