NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Xie, Tinghao; Qi, Xiangyu; Zeng, Yi; Huang, Yangsibo; Sehwag, Udari; Huang, Kaixuan; He, Luxi; Wei, Boyi; Li, Dacheng; Sheng, Ying; et al (January 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available January 22, 2026
Fairness in Serving Large Language Models

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Zhu, Banghua; Li, Zhuohan; Zhuo, Danyang; Gonzalez, Joseph E; Stoica, Ion (July 2024, 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24))

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2× tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact.
more » « less
Full Text Available
SLoRA: Scalable Serving of Thousands of LoRA Adapters

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Hooper, Coleman; Lee, Nicholas; Yang, Shuo; Chou, Christopher; Zhu, Banghua; Zheng, Lianmin; Keutzer, Kurt; et al (May 2024, Proceedings of Machine Learning and Systems 6 (MLSys 2024))

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL
more » « less
Full Text Available
Combining Stable Infiniteness and (Strong) Politeness

https://doi.org/10.1007/s10817-023-09684-0

Sheng, Ying; Zohar, Yoni; Ringeissen, Christophe; Reynolds, Andrew; Barrett, Clark; Tinelli, Cesare (December 2023, Journal of Automated Reasoning)

Polite theory combination is a method for obtaining a solver for a combination of two (or more) theories using the solvers of each individual theory as black boxes. Unlike the earlier Nelson–Oppen method, which is usable only when both theories are stably infinite, only one of the theories needs to be strongly polite in order to use the polite combination method. In its original presentation, politeness was required from one of the theories rather than strong politeness, which was later proven to be insufficient. The first contribution of this paper is a proof that indeed these two notions are different, obtained by presenting a polite theory that is not strongly polite. We also study several variants of this question. The cost of the generality afforded by the polite combination method, compared to the Nelson–Oppen method, is a larger space of arrangements to consider, involving variables that are not necessarily shared between the purified parts of the input formula. The second contribution of this paper is a hybrid method (building on both polite and Nelson–Oppen combination), which aims to reduce the number of considered variables when a theory is stably infinite with respect to some of its sorts but not all of them. The time required to reason about arrangements is exponential in the worst case, so reducing the number of variables considered has the potential to improve performance significantly. We show preliminary evidence for this by demonstrating significant speed-up on a smart contract verification benchmark.
more » « less
Full Text Available
Efficient Memory Management for Large Language Model Serving with PagedAttention

https://doi.org/10.1145/3600006.3613165

Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (October 2023, ACM)

Full Text Available
Efficiently Programming Large Language Models using SGLang.

Zheng, Lianmin; Yin, Liangsheng; Xie, Zhiqiang; Huang, Jeff; Sun, Chuyue; Yu, Cody_Hao; Cao, Shiyi; Kozyrakis, Christos; Stoica, Ion; Gonzalez, Joseph E; et al (December 2023, arXiv)

Full Text Available
Reasoning About Vectors: Satisfiability Modulo a Theory of Sequences

https://doi.org/10.1007/s10817-023-09682-2

Sheng, Ying; Nötzli, Andres; Reynolds, Andrew; Zohar, Yoni; Dill, David; Grieskamp, Wolfgang; Park, Junkil; Qadeer, Shaz; Barrett, Clark; Tinelli, Cesare (September 2023, Journal of Automated Reasoning)
Blanchette, Jasmin; Kovács, Laura; Pattinson, Dirk (Ed.)
Dynamic arrays, also referred to as vectors, are fundamental data structures used in many programs. Modeling their semantics efficiently is crucial when reasoning about such programs. The theory of arrays is widely supported but is not ideal, because the number of elements is fixed (determined by its index sort) and cannot be adjusted, which is a problem, given that the length of vectors often plays an important role when reasoning about vector programs. In this paper, we propose reasoning about vectors using a theory of sequences. We introduce the theory, propose a basic calculus adapted from one for the theory of strings, and extend it to efficiently handle common vector operations. We prove that our calculus is sound and show how to construct a model when it terminates with a saturated configuration. Finally, we describe an implementation of the calculus in cvc5 and demonstrate its efficacy by evaluating it on verification conditions for smart contracts and benchmarks derived from existing array benchmarks.
more » « less
Full Text Available
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.

Li, Zhuohan; Lianmin, Zheng; Zhong, Yinmin; Liu, Vincent; Sheng, Ying; Jin, Xin; Huang, Yanping; Chen, Zhifeng; Zhang, Hao; Gonzalez, Joseph; et al (July 2023, USENIX Association)

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.
more » « less
Full Text Available
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin; Chiang, Wei-Lin; Sheng, Ying; Zhuang, Siyuan; Wu, Zhanghao; Zhuang, Yonghao; Lin, Zi; Li, Zhuohan; Li, Dacheng; Xing, Eric_P; et al (June 2023, arXiv)

Full Text Available
Reasoning About Vectors Using an SMT Theory of Sequences

https://doi.org/10.1007/978-3-031-10769-6_9

Sheng, Ying; Noetzli, Andres; Reynolds, Andrew; Zohar, Yoni; Dill, David; Grieskamp, Wolfgang; Park, Junkil; Qadeer, Shaz; Barrett, Clark; Tinelli, Cesare (July 2022, International Joint Conference on Automated Reasoning (IJCAR))
Blanchette, Jasmin; Kovacs, Laura; Pattinson, Dirk (Ed.)
Dynamic arrays, also referred to as vectors, are fundamental data structures used in many programs. Modeling their semantics efficiently is crucial when reasoning about such programs. The theory of arrays is widely supported but is not ideal, because the number of elements is fixed (determined by its index sort) and cannot be adjusted, which is a problem, given that the length of vectors often plays an important role when reasoning about vector programs. In this paper, we propose reasoning about vectors using a theory of sequences. We introduce the theory, propose a basic calculus adapted from one for the theory of strings, and extend it to efficiently handle common vector operations. We prove that our calculus is sound and show how to construct a model when it terminates with a saturated configuration. Finally, we describe an implementation of the calculus in cvc5 and demonstrate its efficacy by evaluating it on verification conditions for smart contracts and benchmarks derived from existing array benchmarks.
more » « less
Full Text Available

« Prev Next »

Search for: All records