Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one canpre-compute the KV cache of a text andre-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks arenotalways the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’scross-attentionwith the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized. This paper tackles just one challenge: when an LLM input containsmultipletext chunks,how to quickly combine their precomputed KV cachesin order to achieve the same generation quality as the expensive full prefill (i.e.,without reusing KV cache)? This challenge naturally arises in retrieval-augmented generation (RAG) where the input is supplemented with multiple retrieved texts as the context. We presentCacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, andselectively recomputes the KV values of a small subset of tokensto partially update each reused KV cache. In the meantime, the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job, allowingCacheBlendto store KV caches in slower devices with more storage capacity while retrieving themwithoutincreasing the inference delay. By comparingCacheBlendwith the state-of-the-art KV cache reusing schemes on six open-source LLMs of various sizes and five popular benchmark datasets of different tasks, we show thatCacheBlendreduces time-to-first-token (TTFT) by 2.2–3.3 × and increases the inference throughput by 2.8-5 × from full KV recompute without compromising generation quality. The code is available athttps://github.com/LMCache/LMCache.more » « less
-
Far-memory techniques that enable applications to use remote memory are increasingly appealing in modern data centers, supporting applications’ large memory footprint and improving machines’ resource utilization. Unfortunately, most far-memory techniques focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. With different object-access patterns from applications, GC can severely interfere with existing far-memory techniques, breaking remote memory prefetching algorithms and causing severe local-memory misses. We developed MemLiner, a runtime technique that improves the performance of far-memory systems by aligning memory accesses from application and GC threads so that they follow similar memory access paths, thereby (1) reducing the local-memory working set and (2) improving remote-memory prefetching through simplified memory access patterns. We implemented MemLiner in two widely used GCs in OpenJDK: G1 and Shenandoah. Our evaluation with a range of widely deployed cloud systems shows that MemLiner improves applications’ end-to-end performance by up to3.3×and reduces applications’ tail latency by up to220.0×.more » « less
-
ML APIs have greatly relieved application developers of the burden to design and train their own neural network models—classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications. To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%.more » « less
An official website of the United States government

Full Text Available