skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Achieving the Performance of Global Adaptive Routing using Local Information on Dragonfly through Deep Learning
he Universal Globally Adaptive Load-balance Routing (UGAL) with global information, referred as UGAL-G, represents an ideal form of adaptive routing on Dragonfly. UGAL-G is impractical to implement, however, since the global information cannot be maintained accurately. Practical adaptive routing schemes, such as UGAL with local information (UGAL-L), performs noticeably worse than UGAL-G. In this work, we investigate a machine learning approach for routing on Dragonfly. Specifically, we develop a machine learning-based routing scheme, called UGAL-ML, that is capable of making routing decisions like UGAL-G based only on the information local to each router. Our preliminary evaluation indicates that UGAL-ML can achieve comparable performance to UGAL-G for some traffic patterns.  more » « less
Award ID(s):
1822737
PAR ID:
10231745
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ACM/IEEE SC tech poster
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Dragonfly networks have been adopted in the current supercomputers, and will be deployed in future generation supercomputers and data centers. Effective routing on Dragonfly is challenging. Universal Globally Adaptive Load-balanced routing (UGAL) is the state-of-the-art routing algorithm for Dragonfly. For each packet, UGAL selects either a minimal path or a non-minimal path based on their estimated latencies. Practical UGAL makes routing decisions with local information, deriving the estimated latency for each path from the local queue occupancy and path hop count information. In this work, we develop techniques to improve the accuracy of the latency estimation for UGAL with local information, which results in more effective routing decisions. In particular, our schemes are able to proactively mitigate the potential network congestion with imbalanced network traffic. Extensive simulation experiments using synthetic traffic patterns and application workloads demonstrate that our enhanced UGAL schemes significantly improve the routing performance for many common traffic conditions. 
    more » « less
  2. Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations. 
    more » « less
  3. The Dragonfly network has been deployed in the current generation supercomputers and will be used in the next generation supercomputers. The Universal Globally Adaptive Load-balance routing (UGAL) is the state-of-the-art routing scheme for Dragonfly. In this work, we show that the performance of the conventional UGAL can be further improved on many practical Dragonfly networks, especially the ones with a small number of groups, by customizing the paths used in UGAL for each topology. We develop a scheme to compute the custom sets of paths for each topology and compare the performance of our topology-custom UGAL routing (T-UGAL) with conventional UGAL. Our evaluation with different UGAL variations and different topologies demonstrates that by customizing the routes, T-UGAL offers significant improvements over UGAL on many practical Dragonfly networks in terms of both latency when the network is under low load and throughput when the network is under high load. 
    more » « less
  4. Network-on-chip (NoC) is widely used to facilitate communication between components in sophisticated system-on-chip (SoC) designs. Security of the on-chip communication is crucial because exploiting any vulnerability in shared NoC would be a goldmine for an attacker that puts the entire computing infrastructure at risk. We investigate the security strength of existing anonymous routing protocols in NoC architectures, making two pivotal contributions. Firstly, we develop and perform a machine learning (ML)-based flow correlation attack on existing anonymous routing techniques in NoC systems, revealing that they provide only packet-level anonymity. Secondly, we propose a novel, lightweight anonymous routing protocol featuring outbound traffic tunneling and traffic obfuscation. This protocol is designed to provide robust defense against ML-based flow correlation attacks, ensuring both packet-level and flow-level anonymity. Experimental evaluation using both real and synthetic traffic demonstrates that our proposed attack successfully deanonymizes state-of-the-art anonymous routing in NoC architectures with high accuracy (up to 99%) for diverse traffic patterns. It also reveals that our lightweight anonymous routing protocol can defend against ML-based attacks with minor hardware and performance overhead. 
    more » « less
  5. This paper is the first to analyze the security of split manufacturing using machine learning (ML), based on data collected from layouts provided by industry, with eight routing metal layers and significant variation in wire size and routing congestion across the layers. Many types of layout features are considered in our ML model, including those obtained from placement, routing, and cell sizes. Since the runtime cost of our basic ML procedure becomes prohibitively large for lower layers, we propose novel techniques to make it scalable with little sacrifice in the effectiveness of the attack. Moreover, we further improve the performance in the top routing layer by making use of higher quality training samples and by exploiting the routing convention. We also propose a validation-based proximity attack procedure, which generally outperforms our recent prior work. In the experiments, we analyze the ranking of the features used in our ML model and show how features vary in importance when moving to the lower layers. We provide comprehensive evaluation and comparison of our model with different configurations and demonstrate dramatically better performance of attacks compared to the prior work. 
    more » « less