AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.

Li, Zhuohan; Lianmin, Zheng; Zhong, Yinmin; Liu, Vincent; Sheng, Ying; Jin, Xin; Huang, Yanping; Chen, Zhifeng; Zhang, Hao; Gonzalez, Joseph; Stoica, Ion

Citation Details

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests. more »

Award ID(s):: 1730628

PAR ID:: 10523926

Author(s) / Creator(s):: Li, Zhuohan; Lianmin, Zheng; Zhong, Yinmin; Liu, Vincent; Sheng, Ying; Jin, Xin; Huang, Yanping; Chen, Zhifeng; Zhang, Hao; Gonzalez, Joseph; Stoica, Ion

Publisher / Repository:: USENIX Association

Date Published:: 2023-07-12

ISBN:: 978-1-939133-34-2

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this