Continuous HPC Performance Monitoring: Can It Run Without Affecting User Jobs?

Joseph, Jyothismaria; Simakov, Nikolay A

doi:10.1145/3708035.3736036

Citation Details

This content will become publicly available on July 18, 2026

Continuous HPC Performance Monitoring: Can It Run Without Affecting User Jobs?

High-performance computing (HPC) resources are used in a wide range of scientic and engineering calculations. These resources have high initial and running costs. Thus, their optimal performance is crucial. There are a number of strategies to ensure the optimal state. One of them is continuous performance monitoring, where a set of applications and input parameters are executed regularly to identify performance issues proactively. Some sites hesitate to use such a strategy as it takes away the CPU cycles from actual users. The goal of this work is to identify node availability, both size- and time-wise, on busy HPC systems. Such availability spots can be used to tailor test jobs to minimize user impact. Two systems were analyzed: small - 118 nodes from the Center for Computational Research at the University at Bualo and large - 1,160 nodes from the Texas Advanced Computing Center. It was found that for days with 90% utilization and above, there are plenty of opportunities for test jobs. For example, on a small cluster, 8 nodes for 30 minutes are available for an average of 2.3 hours throughout the day. That is 9.6% of the day the scheduler has the opportunity to schedule such a job. On a large system, 32 nodes for 30 minutes were available on average 9.2 hours a day (or 38% of day). Thus, there is a space for test jobs, but it is not evident that the scheduler can benet from it, and a proper strategy must be used, for example, by lowering test job priorities. more »

Award ID(s):: 2137603

PAR ID:: 10625661

Author(s) / Creator(s):: Joseph, Jyothismaria; Simakov, Nikolay A

Publisher / Repository:: ACM

Date Published:: 2025-07-18

ISBN:: 9798400713989

Page Range / eLocation ID:: 1 to 3

Format(s):: Medium: X

Location:: Columbus Ohio USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 18, 2026
Conference Paper:
https://doi.org/10.1145/3708035.3736036

More Like this