Title: GSLICE: controlled spatial sharing of GPUs for a scalable inference platform
The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance.
GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput. more »« less
Dhakal, Aditya; Kulkarni, Sameer; Ramakrishnan, K. K.(
, IEEE International Conference on Cloud Computing)
null
(Ed.)
Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×.
Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K.(
, Proc. of Riding with AI towards Mission-Critical Communications and Computing at the Edge (AIMCOM2) Workshop in IEEE ICNP 2020)
null
(Ed.)
Edge clouds can provide very responsive services for
end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and
accelerators such as GPUs are limited and must be shared across
multiple concurrently running clients. However, multiplexing
GPUs across applications is challenging. Further, edge servers
are likely to require considerable amounts of streaming data to
be processed. Getting that data from the network stream to the
GPU can be a bottleneck, limiting the amount of work GPUs
do. Finally, the lack of prompt notification of job completion
from GPU also results in ineffective GPU utilization. We propose
a framework that addresses these challenges in the following
manner. We utilize spatial sharing of GPUs to multiplex the GPU
more efficiently. While spatial sharing of GPU can increase GPU
utilization, the uncontrolled spatial sharing currently available
with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency.
Our framework utilizes controlled spatial sharing of GPU, which
limits the interference across applications. Our framework uses
the GPU DMA engine to offload data transfer to GPU, therefore
preventing CPU from being bottleneck while transferring data
from the network to GPU. Our framework uses the CUDA
event library to have timely, low overhead GPU notifications.
Preliminary experiments show that we can achieve low DNN
inference latency and improve DNN inference throughput by a
factor of ∼1.4.
Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K.(
, 2020 IEEE 9th International Conference on Cloud Networking (CloudNet))
null
(Ed.)
Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.
Ravi, John; Nguyen, Tri; Zhou, Huiyang; Becchi, Michela(
, 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC))
Concurrent kernel execution on GPU has proven an effective technique to improve system throughput by maximizing the resource utilization. In order to increase programmability and meet the increasing memory requirements of data-intensive applications, current GPUs support Unified Virtual Memory (UVM), which provides a virtual memory abstraction with demand paging. By allowing applications to oversubscribe GPU memory, UVM provides increased opportunities to share GPU resources across applications. However, in the presence of applications with competing memory requirements, GPU sharing can lead to performance degradation due to thrashing. NVIDIA's Multiple Process Service (MPS) offers the capability to space share bare metal GPUs, thereby enabling cluster workload managers, such as Slurm, to share a single GPU across MPI ranks with limited control over resource partitioning. However, it is not possible to preempt, schedule, or throttle a running GPU process through MPS. These features would enable new OS-managed scheduling policies to be implemented for GPU kernels to dynamically handle resource contention and offer consistent performance. The contribution of this paper is two-fold. We first show how memory oversubscription can impact the performance of concurrent GPU applications. Then, we propose three methods to transparently mitigate memory interference through kernel preemption and scheduling policies. To implement our policies, we develop our own runtime system (PILOT) to serve as an alternative to NVIDIA's MPS. In the presence of memory over-subscription, we noticed a dramatic improvement in the overall throughput when using our scheduling policies and runtime hints.
Campbell, C.; Mecca, N.; Duong, T.; Obeid, I.; Picone, J.(
, IEEE Signal Processing in Medicine and Biology Symposium (SPMB))
Obeid, Iyad; Selesnick, Ivan; Picone, Joseph
(Ed.)
The goal of this work was to design a low-cost computing facility that can support the development of an
open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital
pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database
requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a
significant allocation of computing resources. The improvements and expansions to our HPC (highperformance
computing) cluster, known as Neuronix [2], required to support working with digital
pathology fall into two broad categories: computation and storage. To handle the increased computational
burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For
storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem
across multiple machines. These enhancements, which are entirely based on open source software, have
extended the capabilities of our cluster and increased its cost-effectiveness.
Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most
notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous
performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling
them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be
used to configure and enable support for resources beyond the ones provided by the traditional HPC
scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by
Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation.
This becomes very important as the computational demands of the jobs increase, so that they have all the
resources they need, and that they don’t take resources from other jobs. It is a common practice among
GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs,
attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of
these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in
exclusive-process mode.
To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage
solution. We utilized a number of open source tools to create a single filesystem, which can be mounted
by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into
4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with
Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution
that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File
System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using
Gluster [7].
ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes
advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data
corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent
log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means
to connect them together over the network and using distributed (similar to RAID 0 but without striping
individual files), and mirrored (similar to RAID 1) configurations [8].
By implementing these improvements, it has been possible to expand the storage and computational power
of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling
horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent
price/performance ratio [1].
Dhakal, Aditya, Kulkarni, Sameer G, and Ramakrishnan, K. K. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. Retrieved from https://par.nsf.gov/biblio/10299299. SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing . Web. doi:10.1145/3419111.3421284.
Dhakal, Aditya, Kulkarni, Sameer G, & Ramakrishnan, K. K. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing, (). Retrieved from https://par.nsf.gov/biblio/10299299. https://doi.org/10.1145/3419111.3421284
Dhakal, Aditya, Kulkarni, Sameer G, and Ramakrishnan, K. K.
"GSLICE: controlled spatial sharing of GPUs for a scalable inference platform". SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing (). Country unknown/Code not available. https://doi.org/10.1145/3419111.3421284.https://par.nsf.gov/biblio/10299299.
@article{osti_10299299,
place = {Country unknown/Code not available},
title = {GSLICE: controlled spatial sharing of GPUs for a scalable inference platform},
url = {https://par.nsf.gov/biblio/10299299},
DOI = {10.1145/3419111.3421284},
abstractNote = {The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance. GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput.},
journal = {SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing},
author = {Dhakal, Aditya and Kulkarni, Sameer G and Ramakrishnan, K. K.},
editor = {null}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.