Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

Ahmad, Sohaib; Guan, Hui; Friedman, Brain; Williams, Thomas; Sitaraman, Ramesh; Woo, Thomas

Citation Details

Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inferenceserving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements. more »

Award ID(s):: 2338512 2312396 2220211 2224054

PAR ID:: 10538823

Author(s) / Creator(s):: Ahmad, Sohaib; Guan, Hui; Friedman, Brain; Williams, Thomas; Sitaraman, Ramesh; Woo, Thomas

Publisher / Repository:: ASPLOS'24

Date Published:: 2024-04-07

ISBN:: 979-8-4007-0372-0

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this