NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FAAStloop: Optimizing Loop-Based Applications for Serverless Computing

https://doi.org/10.1145/3698038.3698560

Mohanty, Shruti; Bhasi, Vivek M; Son, Myungjun; Kandemir, Mahmut Taylan; Das, Chita (November 2024, ACM)

Full Text Available
Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware

https://doi.org/10.1109/IPDPS57955.2024.00018

Bhasi, Vivek M; Sharma, Aakash; Mohanty, Shruti; Kandemir, Mahmut Taylan; Das, Chita R (May 2024, IEEE)

Among the variety of applications (apps) being deployed on serverless platforms, apps such as Machine Learning (ML) inference serving can achieve better performance from leveraging accelerators like GPUs. Yet, major serverless providers, despite having GPU-equipped servers, do not offer GPU support for their serverless functions. Given that serverless functions are deployed on various generations of CPUs already, extending this to various (typically more expensive) GPU generations can offer providers a greater range of hardware to serve incoming requests according to the functions and request traffic. Here, providers are faced with the challenge of selecting hardware to reach a well-proportioned trade-off point between cost and performance. While recent works have attempted to address this, they often fail to do so as they overlook optimization opportunities arising from intelligently leveraging existing GPU sharing mechanisms. To address this point, we devise a heterogeneous serverless framework, PALDIA, which uses a prudent Hardware selection policy to acquire capable, costeffective hardware and perform intelligent request scheduling on it to yield high performance and cost savings. Specifically, our scheduling algorithm employs hybrid spatio-temporal GPU sharing that intelligently trades off job queueing delays and interference to allow the chosen cost-effective hardware to also be highly performant. We extensively evaluate PALDIA using 16 ML inference workloads with real-world traces on a 6 node heterogeneous cluster. Our results show that PALDIA significantly outperforms state-of-the-art works in terms of Service Level Objective (SLO) compliance (up to 13.3% more) and tail latency (up to ∼50% less), with cost savings up to 86%.
more » « less
Full Text Available
Stash: A Comprehensive Stall-Centric Characterization of Public Cloud VMs for Distributed Deep Learning

https://doi.org/10.1109/ICDCS57875.2023.00023

Sharma, Aakash; Bhasi, Vivek M; Singh, Sonali; Jain, Rishabh; Gunasekaran, Jashwant Raj; Mitra, Subrata; Kandemir, Mahmut Taylan; Kesidis, George; Das, Chita R (July 2023, IEEE)

Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
more » « less
Full Text Available
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

https://doi.org/10.1109/MICRO61859.2024.00091

Jain, Rishabh; Bhasi, Vivek M; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (November 2024, IEEE)

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2× embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
more » « less
Full Text Available
Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU Architectures

https://doi.org/10.1145/3652892.3700760

Bhasi, Vivek M; Sharma, Aakash; Jain, Rishabh; Gunasekaran, Jashwant Raj; Pattnaik, Ashutosh; Kandemir, Mahmut Taylan; Das, Chita (December 2024, ACM)

Full Text Available
Cypress: input size-sensitive container provisioning and request scheduling for serverless platforms

https://doi.org/10.1145/3542929.3563464

Bhasi, Vivek M.; Gunasekaran, Jashwant Raj; Sharma, Aakash; Kandemir, Mahmut Taylan; Das, Chita (November 2022, SoCC '22: Proceedings of the 13th Symposium on Cloud Computing)

The growing popularity of the serverless platform has seen an increase in the number and variety of applications (apps) being deployed on it. The majority of these apps process user-provided input to produce the desired results. Existing work in the area of input-sensitive profiling has empirically shown that many such apps have input size-dependent execution times which can be determined through modelling techniques. Nevertheless, existing serverless resource management frameworks are agnostic to the input size-sensitive nature of these apps. We demonstrate in this paper that this can potentially lead to container over-provisioning and/or end-to-end Service Level Objective (SLO) violations. To address this, we propose Cypress, an input size-sensitive resource management framework, that minimizes the containers provisioned for apps, while ensuring a high degree of SLO compliance. We perform an extensive evaluation of Cypress on top of a Kubernetes-managed cluster using 5 apps from the AWS Serverless Application Repository and/or Open-FaaS Function Store with real-world traces and varied input size distributions. Our experimental results show that Cypress spawns up to 66% fewer containers, thereby, improving container utilization and saving cluster-wide energy by up to 2.95X and 23%, respectively, versus state-of-the-art frameworks, while remaining highly SLO-compliant (up to 99.99%).
more » « less
Full Text Available
Stash: A comprehensive stall-centric characterization of public cloud VMs for distributed deep learning

Sharma, Aakash; Bhasi, Vivek; Singh, Sonali; Jain, Rishabh; Raj, Jashwant; Mitra, Subrata; Kandemir, Mahmut Taylan; Kesidis, George; Das, Chita (January 2023, Proceedings of the International Conference on Distributed Computing Systems)

Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
more » « less
Full Text Available
Kraken: Adaptive Container Provisioning for Deploying Dynamic DAGs in Serverless Platforms

https://doi.org/10.1145/3472883.3486992

Bhasi, Vivek M.; Gunasekaran, Jashwant Raj; Thinakaran, Prashanth; Mishra, Cyan Subhra; Kandemir, Mahmut Taylan; Das, Chita (November 2021, SoCC '21: Proceedings of the ACM Symposium on Cloud Computing)

Full Text Available

Search for: All records