skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency
The salient pay-per-use nature of serverless computing has driven its continuous penetration as an alternative computing paradigm for various workloads. Yet, challenges arise and remain open when shifting machine learning workloads to the serverless environment. Specifically, the restriction on the deployment size over serverless platforms combining with the complexity of neural network models makes it difficult to deploy large models in a single serverless function. In this paper, we aim to fully exploit the advantages of the serverless computing paradigm for machine learning workloads targeting at mitigating management and overall cost while meeting the response-time Service Level Objective (SLO). We design and implement AMPS-Inf, an autonomous framework customized for model inferencing in serverless computing. Driven by the cost-efficiency and timely-response, our proposed AMPS-Inf automatically generates the optimal execution and resource provisioning plans for inference workloads. The core of AMPS-Inf relies on the formulation and solution of a Mixed-Integer Quadratic Programming problem for model partitioning and resource provisioning with the objective of minimizing cost without violating response time SLO. We deploy AMPS-Inf on the AWS Lambda platform, evaluate with the state-of-the-art pre-trained models in Keras including ResNet50, Inception-V3 and Xception, and compare with Amazon SageMaker and three baselines. Experimental results demonstrate that AMPSInf achieves up to 98% cost saving without degrading response time performance.  more » « less
Award ID(s):
2019511
PAR ID:
10338304
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 50th International Conference on Parallel Processing (ICPP 2021)
Page Range / eLocation ID:
1 to 12
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Serverless computing has become increasingly popular for cloud applications, due to its compelling properties of high-level abstractions, lightweight runtime, high elasticity and pay-per-use billing. In this revolutionary computing paradigm shift, challenges arise when adapting data analytics applications to the serverless environment, due to the lack of support for efficient state sharing, which attract ever-growing research attention. In this paper, we aim to exploit the advantages of task level orchestration and fine-grained resource provisioning for data analytics on serverless platforms, with the hope of fulfilling the promise of serverless deployment to the maximum extent. To this end, we present ACTS, an autonomous cost-efficient task orchestration framework for serverless analytics. ACTS judiciously schedules and coordinates function tasks to mitigate cold-start latency and state sharing overhead. In addition, ACTS explores the optimization space of fine-grained workload distribution and function resource configuration for cost efficiency. We have deployed and implemented ACTS on AWS Lambda, evaluated with various data analytics workloads. Results from extensive experiments demonstrate that ACTS achieves up to 98% monetary cost reduction while maintaining superior job completion time performance, in comparison with the state-of-the-art baselines. 
    more » « less
  2. The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to 7.8× while achieving even better latency performance. 
    more » « less
  3. Recent years have witnessed increasing interest in machine learning (ML) inferences on serverless computing due to its auto-scaling and cost-effective properties. However, one critical aspect, function granularity, has been largely overlooked, limiting the potential of serverless ML. This paper explores the impact of function granularity on serverless ML, revealing its important effects on the SLO hit rates and resource costs of serverless applications. It further proposes adaptive granularity as an approach to addressing the phenomenon that no single granularity fits all applications and situations. It explores three predictive models and presents programming tools and runtime extensions to facilitate the integration of adaptive granularity into existing serverless platforms. Experiments show adaptive granularity produces up to a 29.2% improvement in SLO hit rates and up to a 24.6% reduction in resource costs over the state-of-the-art serverless ML which uses fixed granularity. 
    more » « less
  4. Serverless computing is an emerging event-driven programming model that accelerates the development and deployment of scalable web services on cloud computing systems. Though widely integrated with the public cloud, serverless computing use is nascent for edge-based, IoT deployments. In this work, we design and develop STOIC (Serverless TeleOperable HybrId Cloud), an IoT application deployment and offloading system that extends the serverless model in three ways. First, STOIC adopts a dynamic feedback control mechanism to precisely predict latency and dispatch workloads uniformly across edge and cloud systems using a distributed serverless framework. Second, STOIC leverages hardware acceleration (e.g. GPU resources) for serverless function execution when available from the underlying cloud system. Third, STOIC can be configured in multiple ways to overcome deployment variability associated with public cloud use. Finally, we empirically evaluate STOIC using real-world machine learning applications and multi-tier IoT deployments (edge and cloud). We show that STOIC can be used for training image processing workloads (for object recognition) – once thought too resource intensive for edge deployments. We find that STOIC reduces overall execution time (response latency) and achieves placement accuracy that ranges from 92% to 97%. 
    more » « less
  5. Serverless computing is a new pay-per-use cloud service paradigm that automates resource scaling for stateless functions and can potentially facilitate bursty machine learning serving. Batching is critical for latency performance and cost-effectiveness of machine learning inference, but unfortunately it is not supported by existing serverless platforms due to their stateless design. Our experiments show that without batching, machine learning serving cannot reap the benefits of serverless computing. In this paper, we present BATCH, a framework for supporting efficient machine learning serving on serverless platforms. BATCH uses an optimizer to provide inference tail latency guarantees and cost optimization and to enable adaptive batching support. We prototype BATCH atop of AWS Lambda and popular machine learning inference systems. The evaluation verifies the accuracy of the analytic optimizer and demonstrates performance and cost advantages over the state-of-the-art method MArk and the state-of-the-practice tool SageMaker. 
    more » « less