skip to main content


Title: InferLine: latency-aware provisioning and scaling for prediction serving pipelines
Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process. In this paper we introduce InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost. InferLine consists of a low-frequency combinatorial planner and a high-frequency auto-scaling tuner. The low-frequency planner leverages stage-wise profiling, discrete event simulation, and constrained combinatorial search to automatically select hardware type, replication, and batching parameters for each stage in the pipeline. The high-frequency tuner uses network calculus to auto-scale each stage to meet tail latency goals in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.  more » « less
Award ID(s):
1846431
NSF-PAR ID:
10245792
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
Page Range / eLocation ID:
477 to 491
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Systems for ML inference are widely deployed today, but they typically optimize ML inference workloads using techniques designed for conventional data serving workloads and miss critical opportunities to leverage the statistical nature of ML. In this paper, we present WILLUMP, an optimizer for ML inference that introduces two statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation. First, WILLUMP automatically cascades feature computation for classification queries: WILLUMP classifies most data inputs using only high-value, low-cost features selected through empirical observations of ML model performance, improving query performance by up to 5× without statistically significant accuracy loss. Second, WILLUMP accurately approximates ML top-K queries, discarding low-scoring inputs with an automatically constructed approximate model and then ranking the remainder with a more powerful model, improving query performance by up to 10× with minimal accuracy loss. WILLUMP automatically tunes these optimizations’ parameters to maximize query performance while meeting an accuracy target. Moreover, WILLUMP complements these statistical optimizations with compiler optimizations to automatically generate fast inference code for ML applications. We show that WILLUMP improves the end-to-end performance of real-world ML inference pipelines curated from major data science competitions by up to 16× without statistically significant loss of accuracy. 
    more » « less
  2. The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to 7.8× while achieving even better latency performance. 
    more » « less
  3. Serverless applications represented as DAGs have been growing in popularity. For many of these applications, it would be useful to estimate the end-to-end (E2E) latency and to allocate resources to individual functions so as to meet probabilistic guarantees for the E2E latency. This goal has not been met till now due to three fundamental challenges. The first is the high variability and correlation in the execution time of individual functions, the second is the skew in execution times of the parallel invocations, and the third is the incidence of cold starts. In this paper, we introduce ORION to achieve these goals. We first analyze traces from a production FaaS infrastructure to identify three characteristics of serverless DAGs. We use these to motivate and design three features. The first is a performance model that accounts for runtime variabilities and dependencies among functions in a DAG. The second is a method for co-locating multiple parallel invocations within a single VM thus mitigating content-based skew among these invocations. The third is a method for pre-warming VMs for subsequent functions in a DAG with the right look-ahead time. We integrate these three innovations and evaluate ORION on AWS Lambda with three serverless DAG applications. Our evaluation shows that compared to three competing approaches, ORION achieves up to 90% lower P95 latency without increasing $ cost, or up to 53% lower $ cost without increasing tail latency. 
    more » « less
  4. Online recommender systems have proven to have ubiquitous applications in various domains. To provide accurate recommendations in real time it is imperative to constantly train and deploy models with the latest data samples. This retraining involves adjusting the model weights by incorporating newly-arrived streaming data into the model to bridge the accuracy gap. To provision resources for the retraining, typically the compute is hosted on VMs, however, due to the dynamic nature of the data arrival patterns, stateless functions would be an ideal alternative over VMs, as they can instantaneously scale on demand. However, it is non-trivial to statically configure the stateless functions because the model retraining exhibits varying resource needs during different phases of retraining. Therefore, it is crucial to dynamically configure the functions to meet the resource requirements, while bridging the accuracy gap. In this paper, we propose Sandpiper, an adaptive framework that leverages stateless functions to deliver accurate predictions at low cost for online recommender systems. The three main ideas in Sandpiper are (i) we design a data-drift monitor that automatically triggers model retraining at required time intervals to bridge the accuracy gap due to incoming data drifts; (ii) we develop an online configuration model that selects the appropriate function configurations while maintaining the model serving accuracy within the latency and cost budget; and (iii) we propose a dynamic synchronization policy for stateless functions to speed up the distributed model retraining leading to cloud cost minimization. A prototype implementation on AWS shows that Sandpiper maintains the average accuracy above 90%, while 3.8× less expensive than the traditional VM-based schemes. 
    more » « less
  5. null (Ed.)
    Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k- choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated. 
    more » « less