skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 19, 2026

Title: Scaling Laws for the Workload Throughput of Emerging Heterogeneous Clusters
Not AvailableNext-generation HPC clusters are evolving into highly heterogeneous systems that integrate traditional computing resources with emerging accelerator technologies such as quantum processors, neuromorphic units, dataflow architectures, and specialized AI accelerators within a unified infrastructure. These advanced systems enable workloads to dynamically utilize different accelerators during various computation phases, creating complex execution patterns. The performance of the workloads can therefore be impacted by many factors, including how the accelerators are shared, their utilization, and their placement within the system. Moreover, effects such as the system and network state due to the overall system load can significantly impact the job completion rate. Understanding, identifying, and quantifying the impact of the most critical factors (e.g., the number of allocated accelerators) will help decide the investment decisions for accelerator acquisition and deployment that can improve the overall system throughput. This paper extensively studies these complex interactions among advanced accelerators within an HPC cluster and various workloads. We introduce a novel analytical model which predicts the speedup of a workload given an accelerator/system configuration. This model can be used to quantify the effect of augmenting additional accelerators on job performance running on an HPC cluster. We validate the model using both simulated and real environments.  more » « less
Award ID(s):
2103510 1807563
PAR ID:
10650836
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
73 to 82
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers. 
    more » « less
  2. Traditionally, HPC workloads have been deployed in bare-metal clusters; but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine (VM) provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM hypervisor (resource management layer). This lack of interaction leads to low cluster utilization and job completion throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads.Towards this, we propose Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating the HPC scheduler along with VM resource manager. We have implemented this framework on the Slurm scheduler along with the vSphere VM resource manager. In order to reduce the VM provisioning overheads, we use instant cloning which shares both the disk and memory with the parent VM, when compared to full VM cloning which has to boot-up a new VM from scratch. Measurements with real-world HPC workloads demonstrate that, instant cloning is 2.5× faster than full cloning in terms of VM provisioning time. Further, it improves resource utilization by up to 40%, and cluster throughput by up to 1.5×, when compared to full clone for bursty job arrival scenarios. 
    more » « less
  3. The energy and latency demands of critical workload execution, such as object detection, in embedded systems vary based on the physical system state and other external factors. Many recent mobile and autonomous System-on-Chips (SoC) embed a diverse range of accelerators with unique power and performance characteristics. The execution flow of the critical workloads can be adjusted to span into multiple accelerators so that the trade-off between performance and energy fits to the dynamically changing physical factors. In this study, we propose running neural network (NN) inference on multiple accelerators of an SoC. Our goal is to enable an energy-performance trade-off with an by distributing layers in a NN between a performance- and a power-efficient accelerator. We first provide an empirical modeling methodology to characterize execution and inter-layer transition times. We then find an optimal layers-to-accelerator mapping by representing the trade-off as a linear programming optimization constraint. We evaluate our approach on the NVIDIA Xavier AGX SoC with commonly used NN models. We use the Z3 SMT solver to find schedules for different energy consumption targets, with up to 98% prediction accuracy. 
    more » « less
  4. Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority function scan hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations,we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use. 
    more » « less
  5. Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job’s network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations. 
    more » « less