NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MFNetSim: A Multi-Fidelity Network Simulation Framework for Multi-Traffic Modeling of Dragonfly Systems

https://doi.org/10.1145/3729424

Wang, Xin; Brown, Kevin A; Ross, Robert B; Carothers, Christopher D; Lan, Zhiling (September 2025, ACM Transactions on Modeling and Computer Simulation)

In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.
more » « less
Free, publicly-accessible full text available September 12, 2026
CQSim+: Symbiotic Simulation for Multi-Resource Scheduling in High-Performance Computing

https://doi.org/10.1145/3726301.3728404

Kurkure, Yash; Sharma, Shambhawi; Wang, Xin; Papka, Michael E; Lan, Zhiling (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
Directing PDES and Surrogate Models in Loosely Coupled Hybrid Simulations

https://doi.org/10.1145/3726301.3728419

Brown, Kevin A; Cruz-Camacho, Elkin; Yoshii, Kazutomo; Wang, Xin; Lan, Zhiling; Carothers, Christopher D; Ross, Robert B (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
Evaluating Energy Efficiency of Ai Accelerators Using Two Mlperf Benchmarks

https://doi.org/10.1109/CCGRID64434.2025.00035

Ferdaus, Farah; Wu, Xingfu; Taylor, Valerie; Lan, Zhiling; Shanmugavelu, Sanjif; Vishwanath, Venkatram; Papka, Michael E (May 2025, IEEE)

Free, publicly-accessible full text available May 19, 2026
Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System

https://doi.org/10.1145/3706104

Wang, Xin; Kang, Yao; Lan, Zhiling (April 2025, ACM Transactions on Modeling and Computer Simulation)

Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations.
more » « less
Free, publicly-accessible full text available April 30, 2026
Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling

https://doi.org/10.1109/MASCOTS59514.2023.10387651

Li, Boyang; Lan, Zhiling; Papka, Michael E (October 2023, IEEE)

Full Text Available
Hybrid Workload Scheduling on HPC Systems

https://doi.org/10.1109/IPDPS53621.2022.00052

Fan, Yuping; Lan, Zhiling; Rich, Paul; Allcock, William; Papka, Michael E. (May 2022, 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Full Text Available
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling

https://doi.org/https://doi.org/10.1016/j.simpa.2021.100077

Fan, Yuping; Lan, Zhiling (April 2021, Software impacts)
null (Ed.)
Full Text Available
Performance and power modeling and prediction using MuMMI and 10 machine learning methods

https://doi.org/10.1002/cpe.7254

Wu, Xingfu; Taylor, Valerie; Lan, Zhiling (August 2022, Concurrency and Computation: Practice and Experience)

Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power models. In this article, we use the modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and 10 machine learning methods to model and predict performance and power consumption and compare their prediction error rates. We use an algorithm‐based fault‐tolerant linear algebra code and a multilevel checkpointing fault‐tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experimental results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. By utilizing the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential application optimizations, and we predict theoretical outcomes of the optimizations. Based on two collected datasets, we analyze and compare the prediction accuracy in performance and power consumption using MuMMI and 10 machine learning methods.
more » « less
Union: An Automatic Workload Manager for Accelerating Network Simulation

https://doi.org/10.1109/IPDPS47924.2020.00089

Wang, Xin; Mubarak, Misbah; Kang, Yao; Ross, Robert B.; Lan, Zhiling (May 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.
more » « less
Full Text Available

« Prev Next »

Search for: All records