RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Zhang, Di; Dai, Dong; He, Youbiao; Bao, Forrest Sheng; Xie, Bing

Citation Details

Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority function scan hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations,we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use. more »

Award ID(s):: 1817089

PAR ID:: 10196073

Author(s) / Creator(s):: Zhang, Di; Dai, Dong; He, Youbiao; Bao, Forrest Sheng; Xie, Bing

Date Published:: 2020-11-16

Journal Name:: SC'20: The International Conference for High Performance Computing, Networking, Storage, and Analysis 2020

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this