NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SLoRA: Scalable Serving of Thousands of LoRA Adapters

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Hooper, Coleman; Lee, Nicholas; Yang, Shuo; Chou, Christopher; Zhu, Banghua; Zheng, Lianmin; Keutzer, Kurt; et al (May 2024, Proceedings of Machine Learning and Systems 6 (MLSys 2024))

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL
more » « less
Full Text Available
22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

https://doi.org/10.1109/ISSCC42615.2023.10067817

Tambe, Thierry; Zhang, Jeff; Hooper, Coleman; Jia, Tianyu; Whatmough, Paul N.; Zuckerman, Joseph; Santos, Maico Cassel; Loscalzo, Erik Jens; Giri, Davide; Shepard, Kenneth; et al (February 2023, 2023 IEEE International Solid- State Circuits Conference (ISSCC))

Large language models have substantially advanced nuance and context understanding in natural language processing (NLP), further fueling the growth of intelligent conversational interfaces and virtual assistants. However, their hefty computational and memory demands make them potentially expensive to deploy on cloudless edge platforms with strict latency and energy requirements. For example, an inference pass using the state-of-the-art BERT-base model must serially traverse through 12 computationally intensive transformer layers, each layer containing 12 parallel attention heads whose outputs concatenate to drive a large feed-forward network. To reduce computation latency, several algorithmic optimizations have been proposed, e.g., a recent algorithm dynamically matches linguistic complexity with model sizes via entropy-based early exit. Deploying such transformer models on edge platforms requires careful co-design and optimizations from algorithms to circuits, where energy consumption is a key design consideration.
more » « less
Full Text Available
A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

https://doi.org/10.1109/JSSC.2022.3179303

Tambe, Thierry; Yang, En-Yu; Ko, Glenn G.; Chai, Yuji; Hooper, Coleman; Donato, Marco; Whatmough, Paul N.; Rush, Alexander M.; Brooks, David; Wei, Gu-Yeon (June 2022, IEEE Journal of Solid-State Circuits)

Full Text Available
9.8 A 25mm ² SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

https://doi.org/10.1109/ISSCC42613.2021.9366062

Tambe, Thierry; Yang, En-Yu; Ko, Glenn G.; Chai, Yuji; Hooper, Coleman; Donato, Marco; Whatmough, Paul N.; Rush, Alexander M.; Brooks, David; Wei, Gu-Yeon (February 2021, International Solid-State Circuits Conference)

Full Text Available
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

https://doi.org/10.1145/3466752.3480095

Tambe, Thierry; Hooper, Coleman; Pentecost, Lillian; Jia, Tianyu; Yang, En-Yu; Donato, Marco; Sanh, Victor; Whatmough, Paul; Rush, Alexander M.; Brooks, David; et al (October 2021, MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture)

Full Text Available

Search for: All records