skip to main content

Title: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multiuser objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems and then systematically transforms these problems into heterogeneity-aware versions using an abstraction we call effective throughput. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel’s heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as makespan and average job completion time by 1.4⇥ and 3.5⇥ compared to heterogeneity-agnostic policies.
; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
OSDI 2020
Sponsoring Org:
National Science Foundation
More Like this
  1. Compute heterogeneity is increasingly gaining prominence in modern datacenters due to the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers are agnostic of these emerging accelerators, especially their resource utilization footprints, and thus, not well equipped to dynamically provision them based on the application needs. We observe that the state-of-the-art datacenter schedulers fail to provide fine-grained resource guarantees for latency-sensitive tasks that are GPU-bound. Specifically for GPUs, this results in resource fragmentation and interference leading to poor utilization of allocated GPU resources. Furthermore, GPUs exhibit highly linear energy efficiency with respect to utilization and hence proactive management of these resources is essential to keep the operational costs low while ensuring the end-to-end Quality of Service (QoS) in case of user-facing queries.Towards addressing the GPU orchestration problem, we build Knots, a GPU-aware resource orchestration layer and integrate it with the Kubernetes container orchestrator to build Kube- Knots. Kube-Knots can dynamically harvest spare compute cycles through dynamic container orchestration enabling co-location of latency-critical and batch workloads together while improving the overall resource utilization. We design and evaluate two GPU-based scheduling techniques to schedule datacenter-scale workloads through Kube-Knots on a ten node GPU cluster. Our proposed Correlation Based Predictionmore »(CBP) and Peak Prediction (PP) schemes together improves both average and 99 th percentile cluster-wide GPU utilization by up to 80% in case of HPC workloads. In addition, CBP+PP improves the average job completion times (JCT) of deep learning workloads by up to 36% when compared to state-of-the-art schedulers. This leads to 33% cluster-wide energy savings on an average for three different workloads compared to state-of-the-art GPU-agnostic schedulers. Further, the proposed PP scheduler guarantees the end-to-end QoS for latency-critical queries by reducing QoS violations by up to 53% when compared to state-of-the-art GPU schedulers.« less
  2. Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.

  3. Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority function scan hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations,we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use.
  4. Heterogeneous architectures have become increasingly common. From co-packaging small and large cores, to GPUs alongside CPUs, to general-purpose heterogeneous-ISA architectures with cores implementing different ISAs. As diversity of execution cores grows, predictive models become of paramount importance for scheduling and resource allocation. In this paper, we investigate the capabilities of performance predictors in a heterogeneous-ISA setting, as well as the predictors’ effects on scheduler quality. We follow an unbiased feature selection methodology to identify the optimal set of features for this task, instead of pre-selecting features before training. Finally, we incorporate our findings in ML-based schedulers and evaluate their sensitivity to the underlying system’s level of heterogeneity. We show our schedulers to perform within 2-11% of an oracular scheduler across a variety of underlying heterogeneous-ISA multicore systems without modification.
  5. We consider the problem of makespan minimization on unrelated machines when job sizes are stochastic. The goal is to find a fixed assignment of jobs to machines, to minimize the expected value of the maximum load over all the machines. For the identical-machines special case when the size of a job is the same across all machines, a constant-factor approximation algorithm has long been known. Our main result is the first constant-factor approximation algorithm for the general case of unrelated machines. This is achieved by (i) formulating a lower bound using an exponential-size linear program that is efficiently computable and (ii) rounding this linear program while satisfying only a specific subset of the constraints that still suffice to bound the expected makespan. We also consider two generalizations. The first is the budgeted makespan minimization problem, where the goal is to minimize the expected makespan subject to scheduling a target number (or reward) of jobs. We extend our main result to obtain a constant-factor approximation algorithm for this problem. The second problem involves q-norm objectives, where we want to minimize the expected q-norm of the machine loads. Here we give an [Formula: see text]-approximation algorithm, which is a constant-factor approximation formore »any fixed q.« less