NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, Association for Computing Machinery, New York, NY, United States)

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5--4.3x and the total delay in fetching and processing contexts by 3.2--3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.
more » « less
Full Text Available
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

https://doi.org/10.1145/3651890.3672274

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, ACM)

Full Text Available
Toward Human Readable Prompt Tuning: Kubrick’s The Shining is a good movie, and a good prompt too?

https://doi.org/10.18653/v1/2023.findings-emnlp.733

Shi, Weijia; Han, Xiaochuang; Gonen, Hila; Holtzman, Ari; Tsvetkov, Yulia; Zettlemoyer, Luke (January 2023, Association for Computational Linguistics)

Large language models can perform downstream tasks in a zero-shot fashion, given natural language prompts that specify the desired behavior. Such prompts are typically hand engineered, but can also be learned with gradient-based methods from labeled data. However, it is underexplored what factors make the prompts effective, especially when the prompts are in natural language. In this paper, we investigate common attributes shared by effective prompts in classification problems. We first propose a human readable prompt tuning method (FluentPrompt) based on Langevin dynamics that incorporates a fluency constraint to find a distribution of effective and fluent prompts. Our analysis reveals that effective prompts are topically related to the task domain and calibrate the prior probability of output labels. Based on these findings, we also propose a method for generating prompts using only unlabeled data, outperforming strong baselines by an average of 7.0{\%} accuracy across three tasks.
more » « less
Full Text Available
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

https://doi.org/10.18653/v1/2023.acl-long.903

Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita; Pang, Richard Yuanzhe; Phang, Jason; et al (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

https://doi.org/10.18653/v1/2022.emnlp-main.759

Min, Sewon; Lyu, Xinxi; Holtzman, Ari; Artetxe, Mikel; Lewis, Mike; Hajishirzi, Hannaneh; Zettlemoyer, Luke (January 2022, EMNLP)

Full Text Available
TuringAdvice: A Generative and Dynamic Evaluation of Language Use

https://doi.org/10.18653/v1/2021.naacl-main.386

Zellers, Rowan; Holtzman, Ari; Clark, Elizabeth; Qin, Lianhui; Farhadi, Ali; Choi, Yejin (June 2021, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume)

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.
more » « less
Full Text Available
Defending Against Neural Fake News

Zellers, Rowan; Holtzman, Ari; Rashkin, Hannah; Bisk, Yonatan; Farhadi, Ali; Roesner, Franziska; Choi, Yejin (December 2020, Neurips)

Recent progress in natural language generation has raised dual-use concerns. While applications like summarization and translation are positive, the underlying technology also might enable adversaries to generate neural fake news: targeted propaganda that closely mimics the style of real news. Modern computer security relies on careful threat modeling: identifying potential threats and vulnerabilities from an adversary's point of view, and exploring potential mitigations to these threats. Likewise, developing robust defenses against neural fake news requires us first to carefully investigate and characterize the risks of these models. We thus present a model for controllable text generation called Grover. Given a headline like `Link Found Between Vaccines and Autism,' Grover can generate the rest of the article; humans find these generations to be more trustworthy than human-written disinformation. Developing robust verification techniques against generators like Grover is critical. We find that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data. Counterintuitively, the best defense against Grover turns out to be Grover itself, with 92% accuracy, demonstrating the importance of public release of strong generators. We investigate these results further, showing that exposure bias -- and sampling strategies that alleviate its effects -- both leave artifacts that similar discriminators can pick up on. We conclude by discussing ethical issues regarding the technology, and plan to release Grover publicly, helping pave the way for better detection of neural fake news.
more » « less
Full Text Available

Search for: All records