skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning
Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.  more » « less
Award ID(s):
2401245
PAR ID:
10477901
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400701092
Page Range / eLocation ID:
1 to 13
Format(s):
Medium: X
Location:
Denver CO USA
Sponsoring Org:
National Science Foundation
More Like this
  1. The Sia1 scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or another (e.g., heterogeneity or resource-adaptivity), none addresses all and most scale poorly to large clusters and/or heavy workloads even without the full complexity of the combined scheduling problem. Sia introduces a new scheduling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. Sia also introduces a low- profiling-overhead approach to bootstrapping (for each new job) throughput models used to evaluate possible resource assignments, and it is the first cluster scheduler to support elastic scaling of hybrid parallel jobs. Extensive evaluations show that Sia outperforms state-of- the-art schedulers. For example, even on relatively small 44- to 64-GPU clusters with a mix of three GPU types, Sia reduces average job completion time ( JCT) by 30–93%, 99th percentile JCT and makespan by 28–95%, and GPU hours used by 12– 55% for workloads derived from 3 real-world environments. Additional experiments demonstrate that Sia scales to at least 2000-GPU clusters, provides improved fairness, and is not over-sensitive to scheduler parameter settings. 
    more » « less
  2. Deep learning (DL) system research is often impeded by the limited availability and expensive costs of GPUs. In this paper, we introduce GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research without using real GPUs. GPEmu comes with four novel features: time emulation, memory emulation, distributed system support, and sharing support. We support over 30 DL models and 6 GPU models, the largest scale to date. We demonstrate the power of GPEmu by successfully reproducing the main results of nine recent publications and easily prototyping three new micro-optimizations. 
    more » « less
  3. Obeid, Iyad; Selesnick, Ivan; Picone, Joseph (Ed.)
    The goal of this work was to design a low-cost computing facility that can support the development of an open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a significant allocation of computing resources. The improvements and expansions to our HPC (highperformance computing) cluster, known as Neuronix [2], required to support working with digital pathology fall into two broad categories: computation and storage. To handle the increased computational burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem across multiple machines. These enhancements, which are entirely based on open source software, have extended the capabilities of our cluster and increased its cost-effectiveness. Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be used to configure and enable support for resources beyond the ones provided by the traditional HPC scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation. This becomes very important as the computational demands of the jobs increase, so that they have all the resources they need, and that they don’t take resources from other jobs. It is a common practice among GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs, attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in exclusive-process mode. To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage solution. We utilized a number of open source tools to create a single filesystem, which can be mounted by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into 4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using Gluster [7]. ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means to connect them together over the network and using distributed (similar to RAID 0 but without striping individual files), and mirrored (similar to RAID 1) configurations [8]. By implementing these improvements, it has been possible to expand the storage and computational power of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent price/performance ratio [1]. 
    more » « less
  4. Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning. In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions. 
    more » « less
  5. null (Ed.)
    Containerized applications have exploded in popularity in recent years, due to their ease of deployment, reproducible nature, and speed of startup. Accordingly, container orchestration tools such as Kubernetes have emerged as resource providers and users alike try to organize and scale their work across clusters of systems. This paper documents some real-world experiences of building, operating, and using self-hosted Kubernetes Linux clusters. It aims at comparisons between Kubernetes and single-node container solutions and traditional multi-user, batch queue Linux clusters. The authors of this paper have background experience first running traditional HPC Linux clusters and queuing systems like Slurm, and later virtual machines using technologies such as Openstack. Much of the experience and perspective below is informed by this perspective. We will also provide a use-case from a researcher who deployed on Kubernetes without being as opinionated about other potential choices. 
    more » « less