NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

https://doi.org/10.1109/TCC.2024.3476210

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K K (October 2024, IEEE Transactions on Cloud Computing)

Full Text Available
Fine-grained accelerator partitioning for Machine Learning and Scientific Computing in Function as a Service Platform

https://doi.org/10.1145/3624062.3624238

Dhakal, Aditya; Raith, Philipp; Ward, Logan; Hong Enriquez, Rolando P.; Rattihalli, Gourav; Chard, Kyle; Foster, Ian; Milojicic, Dejan (November 2023, ACM)

Full Text Available
SLAM-share: visual simultaneous localization and mapping for real-time multi-user augmented reality

https://doi.org/10.1145/3555050.3569142

Dhakal, Aditya; Ran, Xukan; Wang, Yunshu; Chen, Jiasi; Ramakrishnan, K. K. (November 2022, Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies)

Full Text Available
Primitives Enhancing GPU Runtime Support for improved DNN Performance

Dhakal, Aditya; Kulkarni, Sameer; Ramakrishnan, K. K. (September 2021, IEEE International Conference on Cloud Computing)
null (Ed.)
Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×.
more » « less
Full Text Available
ECML: Improving Efficiency of Machine Learning in Edge Clouds

https://doi.org/10.1109/CloudNet51028.2020.9335804

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (November 2020, 2020 IEEE 9th International Conference on Cloud Networking (CloudNet))
null (Ed.)
Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.
more » « less
Full Text Available
Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing

https://doi.org/10.1109/ICNP49622.2020.9259361

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (October 2020, Proc. of Riding with AI towards Mission-Critical Communications and Computing at the Edge (AIMCOM2) Workshop in IEEE ICNP 2020)
null (Ed.)
Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ∼1.4.
more » « less
Full Text Available
GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

https://doi.org/10.1145/3419111.3421284

Dhakal, Aditya; Kulkarni, Sameer G; Ramakrishnan, K. K. (October 2020, SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing)
null (Ed.)
The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance. GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput.
more » « less
Full Text Available
NetML: An NFV Platform with Efficient Support for Machine Learning Applications

https://doi.org/10.1109/NETSOFT.2019.8806698

Dhakal, Aditya; Ramakrishnan, K. K. (June 2019, 2019 IEEE Conference on Network Softwarization (NetSoft))

Real-time applications such as autonomous and connected cars, surveillance, and online learning applications have to train on streaming data. They require low-latency, high throughput machine learning (ML) functions resident in the network and in the cloud to perform learning and inference. NFV on edge cloud platforms can provide support for these applications by having heterogeneous computing including GPUs and other accelerators to offload ML-related computation. GPUs provide the necessary speedup for performing learning and inference to meet the needs of these latency sensitive real-time applications. Supporting ML inference and learning efficiently for streaming data in NFV platforms has several challenges. In this paper, we present a framework, NetML, that runs existing ML applications on an heterogeneous NFV platform that includes both CPUs and GPUs. NetML efficiently transfers the appropriate packet payload to the GPU, minimizing overheads, avoiding locks, and avoiding CPU-based data copies. Additionally, NetML minimizes latency by maximizing overlap between the data movement and GPU computation. We evaluate the efficiency of our approach for training and inference using popular object detection algorithms on our platform. NetML reduces the latency for inferring images by more than 20% and increases the training throughput by 30% while reducing CPU utilization compared to other state-of-the-art alternatives.
more » « less
Full Text Available

Search for: All records