A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems

Lovell, Austin; Wisniewski, Philip; Rodenbeck, Sarah; Ashish

doi:10.1109/SCW63240.2024.00086

Citation Details

A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems

Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning. In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions. more »

Award ID(s):: 2005632

PAR ID:: 10639606

Author(s) / Creator(s):: Lovell, Austin ; Wisniewski, Philip ; Rodenbeck, Sarah ; Ashish

Publisher / Repository:: IEEE Xplore

Date Published:: 2024-11-17

Page Range / eLocation ID:: 621 to 628

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/SCW63240.2024.00086

More Like this