Serverless computing is gaining popularity for machine learning (ML) serving workload due to its autonomous resource scaling, easy to use and pay-per-use cost model. Existing serverless platforms work well for image-based ML inference, where requests are homogeneous in service demands. That said, recent advances in natural language processing could not fully benefit from existing serverless platforms as their requests are intrinsically heterogeneous. Batching requests for processing can significantly increase ML serving efficiency while reducing monetary cost, thanks to the pay-per-use pricing model adopted by serverless platforms. Yet, batching heterogeneous ML requests leads to additional computation overhead as small requests need to be "padded" to the same size as large requests within the same batch. Reaching effective batching decisions (i.e., which requests should be batched together and why) is non-trivial: the padding overhead coupled with the serverless auto-scaling forms a complex optimization problem. To address this, we develop Multi-Buffer Serving (MBS), a framework that optimizes the batching of heterogeneous ML inference serving requests to minimize their monetary cost while meeting their service level objectives (SLOs). The core of MBS is a performance and cost estimator driven by analytical models supercharged by a Bayesian optimizer. MBS is prototyped and evaluated on AWS using bursty workloads. Experimental results show that MBS preserves SLOs while outperforming the state-of-the-art by up to 8 x in terms of cost savings while minimizing the padding overhead by up to 37 x with 3 x less number of serverless function invocations. 
                        more » 
                        « less   
                    
                            
                            BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching
                        
                    
    
            Serverless computing is a new pay-per-use cloud service paradigm that automates resource scaling for stateless functions and can potentially facilitate bursty machine learning serving. Batching is critical for latency performance and cost-effectiveness of machine learning inference, but unfortunately it is not supported by existing serverless platforms due to their stateless design. Our experiments show that without batching, machine learning serving cannot reap the benefits of serverless computing. In this paper, we present BATCH, a framework for supporting efficient machine learning serving on serverless platforms. BATCH uses an optimizer to provide inference tail latency guarantees and cost optimization and to enable adaptive batching support. We prototype BATCH atop of AWS Lambda and popular machine learning inference systems. The evaluation verifies the accuracy of the analytic optimizer and demonstrates performance and cost advantages over the state-of-the-art method MArk and the state-of-the-practice tool SageMaker. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10206149
- Date Published:
- Journal Name:
- 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Atlanta, GA, US, 2020 pp. 972-986. doi: 10.1109/SC41405.2020.00073
- Volume:
- 1
- Page Range / eLocation ID:
- 972-986
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to 7.8× while achieving even better latency performance.more » « less
- 
            Among the variety of applications (apps) being deployed on serverless platforms, apps such as Machine Learning (ML) inference serving can achieve better performance from leveraging accelerators like GPUs. Yet, major serverless providers, despite having GPU-equipped servers, do not offer GPU support for their serverless functions. Given that serverless functions are deployed on various generations of CPUs already, extending this to various (typically more expensive) GPU generations can offer providers a greater range of hardware to serve incoming requests according to the functions and request traffic. Here, providers are faced with the challenge of selecting hardware to reach a well-proportioned trade-off point between cost and performance. While recent works have attempted to address this, they often fail to do so as they overlook optimization opportunities arising from intelligently leveraging existing GPU sharing mechanisms. To address this point, we devise a heterogeneous serverless framework, PALDIA, which uses a prudent Hardware selection policy to acquire capable, costeffective hardware and perform intelligent request scheduling on it to yield high performance and cost savings. Specifically, our scheduling algorithm employs hybrid spatio-temporal GPU sharing that intelligently trades off job queueing delays and interference to allow the chosen cost-effective hardware to also be highly performant. We extensively evaluate PALDIA using 16 ML inference workloads with real-world traces on a 6 node heterogeneous cluster. Our results show that PALDIA significantly outperforms state-of-the-art works in terms of Service Level Objective (SLO) compliance (up to 13.3% more) and tail latency (up to ∼50% less), with cost savings up to 86%.more » « less
- 
            Serverless computing allows developers to deploy and scale stateless functions in ephemeral workers easily. As a result, serverless computing has been widely used for many applications, such as computer vision, video processing, and HTML generation. However, we find that the stateless nature of serverless computing wastes many of the important benefits modern language runtimes have to offer. A notable example is the extensive profiling and Just-in-Time (JIT) compilation effort that runtimes implement to achieve acceptable performance of popular high-level languages, such as Java, JavaScript, and Python. Unfortunately, when modern language runtimes are naively adopted in serverless computing, all of these efforts are lost upon worker eviction. Checkpoint-restore methods alleviate the problem by resuming workers from snapshots taken after initialization. However, production-grade language runtimes can take up to thousands of invocations to fully optimize a single function, thus rendering naive checkpoint-restore policies ineffective. This paper proposes Pronghorn, a snapshot serverless orchestrator that automatically monitors the function performance and decides (1) when it is the right moment to take a snapshot and (2) which snapshot to use for new workers. Pronghorn is agnostic to the underlying platform and JIT runtime, thus easing its integration into existing runtimes and worker deployment environments (container, virtual machine, etc.). On a set of representative serverless benchmarks, Pronghorn provides end-to-end median latency improvements of 37.2% across 9 out of 13 benchmarks (20-58% latency reduction) when compared to state-of-art checkpointing policies.more » « less
- 
            Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN "cell" (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    