NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Xie, Tinghao; Qi, Xiangyu; Zeng, Yi; Huang, Yangsibo; Sehwag, Udari; Huang, Kaixuan; He, Luxi; Wei, Boyi; Li, Dacheng; Sheng, Ying; et al (January 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available January 22, 2026
Does compressing activations help model parallel training?

Bian, Song; Li, Dacheng; Wang, Hongyi; Xing, Eric P; Venkataraman, Shivaram (May 2024, Seventh Annual Conference on Machine Learning and Systems)

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of Model Parallelism Compression, we present a benchmarking framework MCBench. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically for training with MP compression. The insights derived from our findings can help direct the future development of new MP compression algorithms for distributed training. Our code is available at https://github.com/uw-mad-dash/MCBench
more » « less
Full Text Available
Fairness in Serving Large Language Models

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Zhu, Banghua; Li, Zhuohan; Zhuo, Danyang; Gonzalez, Joseph E; Stoica, Ion (July 2024, 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24))

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2× tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact.
more » « less
Full Text Available
SLoRA: Scalable Serving of Thousands of LoRA Adapters

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Hooper, Coleman; Lee, Nicholas; Yang, Shuo; Chou, Christopher; Zhu, Banghua; Zheng, Lianmin; Keutzer, Kurt; et al (May 2024, Proceedings of Machine Learning and Systems 6 (MLSys 2024))

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL
more » « less
Full Text Available
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin; Chiang, Wei-Lin; Sheng, Ying; Zhuang, Siyuan; Wu, Zhanghao; Zhuang, Yonghao; Lin, Zi; Li, Zhuohan; Li, Dacheng; Xing, Eric_P; et al (June 2023, arXiv)

Full Text Available
Dual Contradistinctive Generative Autoencoder

Parmar, Gaurav; Li, Dacheng; Lee, Kwonjoon; Tu, Zhuowen (June 2021, IEEE Computer Society Conference on Computer Vision and Pattern Recognition)
null (Ed.)
Full Text Available

Search for: All records