Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster

Sultana, Abeda; Xu, Fei; Yuan, Xu; Chen, Li; Tzeng, Nian-Feng

doi:10.1109/IPDPS57955.2024.00066

Citation Details

Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster

With the wide adoption of deep neural network (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3× over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel[1], the state-of-the-art heterogeneity-aware scheduler, by 2.5× for the average JCT, and shortens the queuing delay by 13% and improve FTF (Finish-Time-Fairness) by 1.5%. more »

Award ID(s):: 2019511 2327452

PAR ID:: 10529756

Author(s) / Creator(s):: Sultana, Abeda; Xu, Fei; Yuan, Xu; Chen, Li; Tzeng, Nian-Feng

Publisher / Repository:: IEEE

Date Published:: 2024-05-27

ISBN:: 979-8-3503-8711-7

Page Range / eLocation ID:: 681 to 691

Subject(s) / Keyword(s):: distributed deep learning, scheduling, optimization

Format(s):: Medium: X

Location:: San Francisco, CA, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/IPDPS57955.2024.00066

More Like this