skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Comparing Global Link Arrangements for Dragonfly Networks
High-performance computing systems are shifting away from traditional interconnect topologies to exploit new technologies and to reduce interconnect power consumption. The Dragonfly topology is one promising candidate for new systems, with several variations already in production. It is hierarchical, with local links forming groups and global links joining the groups. At each level, the interconnect is a clique, with a link between each pair of switches in a group and a link between each pair of groups. This paper shows that the intergroup links can be made in meaningfully different ways. We evaluate three previously- proposed approaches for link organization (called global link arrangements) in two ways. First, we use bisection bandwidth, an important and commonly-used measure of the potential for communication bottlenecks. We show that the global link arrangements often give bisection bandwidths differing by 10s of percent, with the specific separation varying based on the relative bandwidths of local and global links. For the link band- widths used in a current Dragonfly implementation, it is 33%. Second, we show that the choice of global link arrangement can greatly impact the regularity of task mappings for nearest neighbor stencil communication patterns, an important pattern in scientific applications.  more » « less
Award ID(s):
1423413
PAR ID:
10047243
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2015 IEEE International Conference on Cluster Computing
Page Range / eLocation ID:
361 to 370
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Dragonfly networks have been proposed to exploit high-radix routers and optical links for high performance computing (HPC) systems. Such networks divide the switches into groups, with a local link between each pair of switches in a group and a global link between each group. Which specific switch serves as the endpoint of each global link is determined by the network’s global link arrangement. We propose two new global link arrangements, each designed using intuition of how to optimize bisection bandwidth when global links have high bandwidth relative to local links. Despite this, the new arrangements generally outperform previously-known arrangements for all bandwidth relationships. 
    more » « less
  2. Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations. 
    more » « less
  3. In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic. 
    more » « less
  4. null (Ed.)
    he Universal Globally Adaptive Load-balance Routing (UGAL) with global information, referred as UGAL-G, represents an ideal form of adaptive routing on Dragonfly. UGAL-G is impractical to implement, however, since the global information cannot be maintained accurately. Practical adaptive routing schemes, such as UGAL with local information (UGAL-L), performs noticeably worse than UGAL-G. In this work, we investigate a machine learning approach for routing on Dragonfly. Specifically, we develop a machine learning-based routing scheme, called UGAL-ML, that is capable of making routing decisions like UGAL-G based only on the information local to each router. Our preliminary evaluation indicates that UGAL-ML can achieve comparable performance to UGAL-G for some traffic patterns. 
    more » « less
  5. Proper identification of oriented knots and 2-component links requires a precise link nomenclature. Motivated by questions arising in DNA topology, this study aims to produce a nomenclature unambiguous with respect to link symmetries. For knots, this involves distinguishing a knot type from its mirror image. In the case of 2-component links, there are up to sixteen possible symmetry types for each link type. The study revisits the methods previously used to disambiguate chiral knots and extends them to oriented 2-component links with up to nine crossings. Monte Carlo simulations are used to report on writhe, a geometric indicator of chirality. There are ninety-two prime 2-component links with up to nine crossings. Guided by geometrical data, linking number, and the symmetry groups of 2-component links, canonical link diagrams for all but five link types (9 5 2, 9 34 2, 9 35 2, 9 39 2, and 9 41 2) are proposed. We include complete tables for prime knots with up to ten crossings and prime links with up to nine crossings. We also prove a result on the behavior of the writhe under local lattice moves. 
    more » « less