Predicting runtime and resource utilization of jobs on integrated cloud and HPC systems

Yildirim, Esma (ORCID:0000000194853714); Hussein, Mohab (ORCID:0009000386027393); Titov, Mikhail (ORCID:0000000323577382); Kilic, Ozgur Ozan (ORCID:000000032129408X)

doi:10.1016/j.future.2025.108230

Citation Details

This content will become publicly available on March 1, 2027

Predicting runtime and resource utilization of jobs on integrated cloud and HPC systems

Recent advances in virtualization technologies used in cloud computing offer performance that closely approaches bare-metal levels. Combined with specialized instance types and high-speed networking services for cluster computing, cloud platforms have become a compelling option for high-performance computing (HPC). However, most current batch job schedulers in HPC systems are designed for homogeneous clusters and make decisions based on limited information about jobs and system status. Scientists typically submit computational jobs to these schedulers with a requested runtime that is often over- or under-estimated. More accurate runtime predictions can help schedulers make better decisions and reduce job turnaround times. They can also support decisions about migrating jobs to the cloud to avoid long queue wait times in HPC systems. In this study, we design neural network models to predict the runtime and resource utilization of jobs on integrated cloud and HPC systems. We developed two monitoring strategies to collect job and system resource utilization data using a workload management system and a cloud monitoring service. We evaluated our models on two Department of Energy (DOE) HPC systems and Amazon Web Services (AWS). Our results show that we can predict the runtime of a job with 31–41 % mean absolute percentage error (MAPE), 14–17 seconds mean absolute value error (MAE), and 0.99 R-squared (R²) score. Having an MAE of less than a minute corresponds to 100 % accuracy since the requested time for batch jobs is always specified in hours and/or minutes more »

Award ID(s):: 2100027

PAR ID:: 10655298

Author(s) / Creator(s):: Yildirim, Esma; Hussein, Mohab; Titov, Mikhail; Kilic, Ozgur Ozan

Publisher / Repository:: Elsevvier

Date Published:: 2026-03-01

Journal Name:: Future Generation Computer Systems

Volume:: 176

Issue:: C

ISSN:: 0167-739X

Page Range / eLocation ID:: 108230

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on March 1, 2027
Journal Article:
https://doi.org/10.1016/j.future.2025.108230

More Like this