skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inferenceserving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements.  more » « less
Award ID(s):
2338512 2312396 2220211 2224054
PAR ID:
10538823
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
ASPLOS'24
Date Published:
ISBN:
979-8-4007-0372-0
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inferenceserving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements. 
    more » « less
  2. Modern Internet of Things (IoT) applications, from contextual sensing to voice assistants, rely on ML-based training and serving systems using pre-trained models to render predictions. However, real-world IoT environments are diverse, with rich IoT sensors and need ML models to be personalized for each setting using relatively less training data. Most existing general-purpose ML systems are optimized for specific and dedicated hardware resources and do not adapt to changing resources and different IoT application requirements. To address this gap, we propose MLIoT, an end-to-end Machine Learning System tailored towards supporting the entire lifecycle of IoT applications. MLIoT adapts to different IoT data sources, IoT tasks, and compute resources by automatically training, optimizing, and serving models based on expressive applicationspecific policies. MLIoT also adapts to changes in IoT environments or compute resources by enabling re-training, and updating models served on the fly while maintaining accuracy and performance. Our evaluation across a set of benchmarks show that MLIoT can handle multiple IoT tasks, each with individual requirements, in a scalable manner while maintaining high accuracy and performance. We compare MLIoT with two state-of-the-art hand-tuned systems and a commercial ML system showing that MLIoT improves accuracy from 50% - 75% while reducing or maintaining latency. 
    more » « less
  3. Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations. 
    more » « less
  4. Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present PERSEUS, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented PERSEUS in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging PERSEUS, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction. 
    more » « less
  5. Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud plat- forms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multi- ple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substan- tially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evalu- ate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adapta- tion policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications. 
    more » « less