NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fairness in Serving Large Language Models

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Zhu, Banghua; Li, Zhuohan; Zhuo, Danyang; Gonzalez, Joseph E; Stoica, Ion (July 2024, 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24))

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2× tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact.
more » « less
Full Text Available
SLoRA: Scalable Serving of Thousands of LoRA Adapters

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Hooper, Coleman; Lee, Nicholas; Yang, Shuo; Chou, Christopher; Zhu, Banghua; Zheng, Lianmin; Keutzer, Kurt; et al (May 2024, Proceedings of Machine Learning and Systems 6 (MLSys 2024))

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL
more » « less
Full Text Available
Efficient Memory Management for Large Language Model Serving with PagedAttention

https://doi.org/10.1145/3600006.3613165

Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (October 2023, ACM)

Full Text Available
Leveraging Cloud Computing to Make Autonomous Vehicles Safer

https://doi.org/10.1109/IROS55552.2023.10341821

Schafhalter, Peter; Kalra, Sukrit; Xu, Le; Gonzalez, Joseph E; Stoica, Ion (October 2023, IEEE)

Full Text Available
D3: a dynamic deadline-driven approach for building autonomous vehicles

https://doi.org/10.1145/3492321.3519576

Gog, Ionel; Kalra, Sukrit; Schafhalter, Peter; Gonzalez, Joseph E.; Stoica, Ion (March 2022, EuroSys '22: Proceedings of the Seventeenth European Conference on Computer Systems)

Autonomous vehicles (AVs) must drive across a variety of challenging environments that impose continuously-varying deadlines and runtime-accuracy tradeoffs on their software pipelines. A deadline-driven execution of such AV pipelines requires a new class of systems that enable the computation to maximize accuracy under dynamically-varying deadlines. Designing these systems presents interesting challenges that arise from combining ease-of-development of AV pipelines with deadline specification and enforcement mechanisms. Our work addresses these challenges through D3 (Dynamic Deadline-Driven), a novel execution model that centralizes the deadline management, and allows applications to adjust their computation by modeling missed deadlines as exceptions. Further, we design and implement ERDOS, an open-source realization of D3 for AV pipelines that exposes finegrained execution events to applications, and provides mechanisms to speculatively execute computation and enforce deadlines between an arbitrary set of events. Finally, we address the crucial lack of AV benchmarks through our state-of-the-art open-source AV pipeline, Pylot, that works seamlessly across simulators and real AVs. We evaluate the efficacy of D3 and ERDOS by driving Pylot across challenging driving scenarios spanning 50km, and observe a 68% reduction in collisions as compared to prior execution models.
more » « less
Full Text Available
Cloudburst: stateful functions-as-a-service

https://doi.org/10.14778/3407790.3407836

Sreekanti, Vikram; Wu, Chenggang; Lin, Xiayue Charles; Schleier-Smith, Johann; Gonzalez, Joseph E.; Hellerstein, Joseph M.; Tumanov, Alexey (August 2020, Proceedings of the VLDB Endowment)
null (Ed.)
Full Text Available
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

https://doi.org/10.1109/CVPR42600.2020.01298

Wan, Alvin; Dai, Xiaoliang; Zhang, Peizhao; He, Zijian; Tian, Yuandong; Xie, Saining; Wu, Bichen; Yu, Matthew; Xu, Tao; Chen, Kan; et al (June 2020, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 10^14x over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.
more » « less
Full Text Available
InferLine: latency-aware provisioning and scaling for prediction serving pipelines

https://doi.org/10.1145/3419111.3421285

Crankshaw, Daniel; Sela, Gur-Eyal; Mo, Xiangxi; Zumar, Corey; Stoica, Ion; Gonzalez, Joseph; Tumanov, Alexey (January 2020, SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing)
null (Ed.)
Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process. In this paper we introduce InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost. InferLine consists of a low-frequency combinatorial planner and a high-frequency auto-scaling tuner. The low-frequency planner leverages stage-wise profiling, discrete event simulation, and constrained combinatorial search to automatically select hardware type, replication, and batching parameters for each stage in the pipeline. The high-frequency tuner uses network calculus to auto-scale each stage to meet tail latency goals in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.
more » « less
Full Text Available
Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Li, Zhuohan; Wallace, Eric; Shen, Sheng; Lin, Kevin; Keutzer, Kurt; Klein, Dan; Gonzalez, Joseph E. (January 2020, Proceedings of the International Conference on Machine Learning (ICML))
null (Ed.)
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
more » « less
Full Text Available
Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Li, Z.; E., Shen; Lin, K.; Keutzer, K.; Klein, D.; Gonzalez, J.E. (January 2020, Proceedings of Machine Learning Research)
III, Hal Daumé (Ed.)
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
more » « less
Full Text Available

« Prev Next »

Search for: All records