<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference</title></titleStmt>
			<publicationStmt>
				<publisher>Association for Computational Linguistics</publisher>
				<date>11/04/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10654606</idno>
					<idno type="doi"></idno>
					
					<author>Dongwei Wang</author><author>Zijie Liu</author><author>Song Wang</author><author>Yuxin Ren</author><author>Jianing Deng</author><author>Jingtong Hu</author><author>Tianlong Chen</author><author>Huanrui Yang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects queryrelevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a Fine-Grained and Efficient KV cache Retrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments show that Fier matches full KV performance using only 11% of the cache budget across various long-context tasks, reducing decoding latency by 1.2× to 1.5×.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>KV caching is a memory-for-computation acceleration technique for LLM inference, enabling faster decoding by reusing intermediate hidden states <ref type="bibr">(Waddington et al., 2013)</ref>. However, during inference, each decoded token must attend to the full KV cache, causing cache reading latency to grow significantly with context length. For example, in LLaMA 7B <ref type="bibr">(Touvron et al., 2023)</ref>, a 32ktoken KV cache requires 16GB of memory and over 11ms to read-accounting for more than half of the total inference time <ref type="bibr">(Tang et al., 2024)</ref>.</p><p>To address this issue, previous works have proposed to selectively retain only a subset of KV cache entries based on token importance. Among them, one line of work-known as KV &#8224;  While existing retrieval methods suffer from coarse granularity, Fier achieves higher accuracy through fine-grained token-level retrieval, and preserves selection efficiency by quantization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corresponding author</head><p>eviction (Fig. <ref type="figure">1(b</ref>))-focuses on retaining fixedposition tokens that typically receive higher attention weights, such as the initial tokens (due to the attention sink phenomenon) and the most recent tokens <ref type="bibr">(Xiao et al., 2023;</ref><ref type="bibr">Zhang et al., 2023;</ref><ref type="bibr">Li et al., 2024;</ref><ref type="bibr">Liu et al., 2023)</ref>. However, these approaches overlook the dynamic nature of KV criticality, i.e., tokens that are evicted earlier may become important in the future. Their inability to recall evicted tokens often results in degraded performance in multi-round QA or long conversation applications. Motivated by this, another line of work-KV retrieval (Fig. <ref type="figure">1(c</ref>))-has been proposed to dynamically recall tokens that are most relevant to the current query during generation <ref type="bibr">(Tang et al., 2024;</ref><ref type="bibr">Chen et al., 2024a)</ref>. Despite achieving better performance, KV retrieval requires frequent estimation of the token importance for every new query, introducing additional computational overhead. To mitigate this, existing methods perform page-level retrieval, where all tokens that belong to a certain page of the KV cache are retrieved (or not re-trieved) simultaneously to avoid computing full attention scores, leading to a coarser granularity of retrieval.</p><p>However, we observe that in long-context tasks, information relevant to the current query may be scattered throughout the entire input, i.e., important tokens are often sparsely distributed across the KV cache (shown in Fig. <ref type="figure">2</ref>). Consequently, page-level retrieval inevitably leads to imprecise selection: retrieved pages may include irrelevant tokens, while evicted pages may exclude critical ones, thereby affecting the model performance.</p><p>In this paper, we aim to address the retrieval inaccuracy while preserving selection efficiency. Specifically, we find that quantizing the key cache to as low as 1-bit has minimal impact on the accuracy of Top-k selection in token importance estimation (shown in Fig. <ref type="figure">3</ref>). Despite quantization truncates large values, important tokens are still preserved in the Top-k after computing quantized attention. Based on this insight, we propose Fine-Grained and Efficient KV cache Retrieval (Fier), a 1-bit quantization-based KV retrieval method. As shown in Fig. <ref type="figure">1</ref>(d), Fier enables more accurate recovery of important tokens and reduces the selection cost. This leads to improved model performance under the same cache budget.</p><p>We evaluate Fier across PG19 (Rae et al., 2019), LongBench <ref type="bibr">(Bai et al., 2023)</ref>, and the passkey retrieval benchmarks <ref type="bibr">(Peng et al., 2023)</ref>. The results demonstrate the effectiveness of Fier in both generative and retrieval-focused settings. Experiments show that Fier achieves performance comparable to using the full KV cache while requiring only 11% of the cache budget, and consistently outperforms existing KV eviction and retrieval baselines. Additionally, Fier achieves 1.2&#215; to 1.5&#215; decoding speedup across different context lengths on a single RTX 4090 GPU. In summary, we make the following contributions in this work:</p><p>&#8226; We observe the sparse distribution of important tokens within the KV cache in longcontext scenarios, highlighting the necessity of fine-grained retrieval.</p><p>&#8226; We propose Fier, a KV retrieval method built on 1-bit key quantization, which enables efficient and accurate token-level retrieval.</p><p>&#8226; We conduct comprehensive evaluations of Fier across diverse long-context tasks and model architectures, demonstrating its superior performance and efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Long-Context LLMs. Large Language Models (LLMs) have transformed the landscape of natural language processing, largely due to their strong ability to deal with long context. Their context length capacity has increased dramatically-from 4k to 128k <ref type="bibr">(Grattafiori et al., 2024)</ref>, 1M <ref type="bibr">(Yang et al., 2025)</ref>, and even 10M <ref type="bibr">(Team et al., 2024)</ref> tokens. This expansion unlocks a range of advanced capabilities, including o1 long-range reasoning <ref type="bibr">(Guo et al., 2025;</ref><ref type="bibr">OpenAI, 2024)</ref>, incontext learning <ref type="bibr">(Li et al., 2025)</ref>, and multimodal intelligence <ref type="bibr">(Weng et al., 2024)</ref>. Fier aims to improve the inference efficiency of long-context LLMs by exploiting the sparsity of the KV cache. KV Cache Eviction. Previous work identified the sparsity of attention matrices, showing that retaining only a small fraction of tokens is sufficient for the performance. For example, <ref type="bibr">Xiao et al. (2023)</ref> propose to retain the first few tokens based on the "attention sink" phenomenon. H2O <ref type="bibr">(Zhang et al., 2023)</ref> retains a limited set of KV cache by selecting tokens with the highest cumulative attention scores. SnapKV <ref type="bibr">(Li et al., 2024)</ref> selects clustered historical tokens along with a localized window of recent tokens. However, these approaches ignore the fact that tokens evicted can become important in the future. Fier addresses this via query-specific KV retrieval, enabling dynamic reassessment of token importance at each decoding step. KV Cache Retrieval. KV retrieval methods, including our proposed Fier, dynamically select tokens relevant to the current query. However, existing approaches like Quest <ref type="bibr">(Tang et al., 2024)</ref>, ArkVale <ref type="bibr">(Chen et al., 2024a)</ref>, PQCache <ref type="bibr">(Zhang et al., 2025)</ref>, SparQattn <ref type="bibr">(Ribar et al., 2023)</ref> and MagicPIG <ref type="bibr">(Chen et al., 2024b)</ref> retrieve at the page or cluster level or compute partial tokens' criticality for efficiency, overlooking the sparse distribution of important tokens. In this paper, we propose a fine-grained, token-level retrieval strategy that maintains efficiency while improving accuracy. This design better captures critical information missed by existing methods. KV Cache Quantization. Another related line of work is KV quantization <ref type="bibr">(Liu et al., 2024;</ref><ref type="bibr">Dong et al., 2024)</ref>, which compresses the cache by performing self-attention in low-bit space. The objec-tive of KV quantization is to minimize global reconstruction error. In contrast, Fier achieves cache compression by retaining only a subset of the full KV and adopts a relaxed quantization objective focused on preserving Top-ranked tokens, enabling the use of extremely low bit-widths.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Preliminaries</head><p>In an autoregressive LLM, the inference process typically consists of two stages: the prefill stage and the decoding stage.</p><p>Prefill Stage. Given an input context X &#8712; R lprompt&#215;d , the model computes the initial key and value representations, denoted as K 0 &#8712; R lprompt&#215;d and V 0 &#8712; R lprompt&#215;d , which together form the initial KV cache. Decoding Stage. At each step, a new query q &#8712; R 1&#215;d is generated and the corresponding key k and value v are appended to the initial cache (K 0 , V 0 ) to form the current KV cache:</p><p>The attention output is then computed as:</p><p>The major overhead of decoding comes from computation of attention score s, which reflects the importance of KV token k j to the query. At each step, the current query must attend to the entire KV cache. This cost becomes higher in longcontext tasks. To address this, previous works have demonstrated attention sparsity, observing that initial tokens (attention sink) and recent tokens (locality) tend to receive higher attention scores. Based on this, they retain a fixed subset of tokens, denoted as (K &#8242; , V &#8242; ) &#8712; R n&#215;d at these positions for all queries, where n is cache budget. However, subsequent studies show that token criticality varies across different queries. As a result, query-specific token selection is necessary, where token importance needs to be recomputed for each query. To improve importance estimation efficiency, existing works tend to perform selection at a coarser, page-level granularity. For example, Quest <ref type="bibr">(Tang et al., 2024)</ref> partitions the key cache K &#8712; R l&#215;d into l L pages (L is typically 16 or 32). For each page P, it extracts maximum and minimum vectors k max P and k min P &#8712; R 1&#215;d , per-forms point-wise multiplication with q,</p><p>and takes the maximum across both the hidden dimension and the two vectors to obtain the page importance score.</p><p>Pages with the highest importance scores are then selected for self-attention computation.</p><p>During the selection, Quest loads only 2 K vectors per page. Assuming K is stored in float16, this results in a key cache load ratio of:</p><p>It is clear that larger page sizes reduce importance estimation costs, but lead to coarser granularity by grouping more tokens together. There exists a trade-off between efficiency and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Fine-grained and Efficient KV Retrieval</head><p>To improve upon existing page-level KV retrieval methods, we make the following two observations on the token importance estimation process. OB1: Important Token Sparsity Makes Page Retrieval Inaccurate. To understand the trade-off of page granularity, we visualize both the highestattended tokens and those selected under pagelevel partitioning by mapping them back to the original context. As shown in Fig. <ref type="figure">2</ref>, queries Q1 and Q2 attend to different regions of the context, which aligns with prior findings on the dynamic nature of token criticality. Moreover, tokens with high attention scores are sparsely distributed across the context, and we observe that pages 7, 16, and 54 each contain a mixture of both important and unimportant tokens. This overlap makes inaccurate retrieval, which pinpoints the necessity of fine-grained retrieval strategies that identify important information at the token level, rather than relying on coarse-grained page grouping.</p><p>Motivated by this, we aim to design a retrieval strategy that operates at fine-grained token level while incurring minimal additional computation overhead. Notably, quantization enables low-bit computation, achieving high efficiency while still allowing every token to participate in the criticality estimation. We begin by validating this insight through the following observation. OB2: Quantization has Minimal Impact on Top-k Selection, even at 1-bit. Quantizing KV cache values to a lower precision significantly reduces the cost of data movement and attention computation. Let K &#8712; R l&#215;d denote the original key cache, where k i &#8712; R 1&#215;d is the key vector corresponding to the i-th token, quantization converts each k i to its dequantized counterpart ki as</p><p>where s K i , z K i &#8712; R 1&#215;d are the per-channel calibrated scaling and bias vectors specific to the ith key vector. Previous KV quantization methods  <ref type="bibr">(Zhang et al., 2024)</ref> aim to optimize these factors through the calibration process to minimize the impact of quantization on the computed attention score. This objective can be formulated as an &#8467; 2 loss:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Top ones remain</head><p>However, quantization introduces perturbations on the values, drawing computation results away from intended. The impact of quantization is more severe if outliers exist in the distribution. Previous methods like Kivi <ref type="bibr">(Liu et al., 2024)</ref>, equipped with advanced channel-wise grouping and rescaling scheme, cannot quantize the KV cache below 2 bit while retaining model performance.</p><p>Meanwhile, we observe that quantizing the key cache to low bits has significantly less impact on the token importance estimation than retaining the attention scores. Here we explore an extreme case by quantizing K to just 1-bit. Using the fullprecision attention scores as ground truth, we evaluate whether Top-k token selection can still be accurately recovered under such an aggressive quantization setting. Specifically, we feed a long input context (14k tokens) into LlaMA during the prefill stage, and compute attention scores under both full-precision and 1-bit quantized K using the same query. As shown in Fig. <ref type="figure">3</ref>, despite the fact that low-bit quantization truncates large values and distorts the overall distribution, the Top-k tokens remain largely unchanged. This suggests that token criticality is still well captured, even un-der extreme quantization.</p><p>To understand the reason, we analyze the quantization objective implied by the goal of token importance estimation. For importance estimation, we aim to maintain the ranking of the Top-k tokens rather than preserving all attention scores precisely. Assume m is the minimum margin between the attention scores of Top-k and non-Topk tokens in the full-precision setting. To preserve this ranking after quantization, it is sufficient to ensure that the attention score of each token deviates from its full-precision counterpart by at most m/2. This leads to the following hinge objective:</p><p>Compared to the &#8467; 2 loss, the hinge loss imposes a relaxed objective that prioritizes preserving the relative ranking of Top-k tokens (Fig. <ref type="figure">4</ref>). More importantly, outlier tokens that lead to large attention scores enjoy larger margins under the hinge loss, making their quantization errors less impactful. This enables quantization with 1-bit and larger group sizes g while still maintaining Top-k accuracy.</p><p>Easy to minimize Hard to minimize </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Fier Workflow</head><p>Motivated by previous observations, we propose Fier, a token-level retrieval method based on 1-bit linear quantization. Fier compresses the key cache into 1-bit using a simple round-to-nearest (RTN) quantizer. Given a query q, approximate attention scores are computed efficiently using quantized keys. Based on these scores, Fier selects the Top-k tokens and performs full-precision selfattention over the selected subset. We summarize the workflow of Fier in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Theoretical Analysis of Efficiency</head><p>Beyond retrieval accuracy, the efficiency of the selection and decoding stage is also critical for prac-Algorithm 1 Fier: Token-Level KV Retrieval via 1-Bit RTN Quantization 1: Input: Query q, full-precision (K, V), group size g 2: Output: Attention output o 3: // Step 1: Quantize K to 1-bit 4: Partition K into groups of size g along each channel 5: For each group, compute the scaling factors (s, z) and broadcast them to construct s</p><p>Step 2: Compute Approximate Attention Scores 9: s = q &#8226; K&#8868; 10: // Step 3: Select Top-k Tokens 11: Sq = Top-k(s) 12: // Step 4: Compute Real Attention on Selected Tokens 13:</p><p>tical deployment. We measure the efficiency using the key cache access ratio (CAR), defined as the fraction of the key cache accessed throughout the pipeline. Specifically, we compare both retrievaland eviction-based methods under the same top-k setting. In the decoding phase, all methods (Fier, Quest, and H2O) access k tokens, so their CAR is identical. The key difference lies in the selection phase. For K stored in float16, we quantize it to 1-bit with group size g. Note that in addition to the 1-bit K Q , each group also needs to store a pair of (s, z) in float16. Therefore, the CAR of Fier will be calculated as:</p><p>which decreases with a larger group size. Recall that Quest has a load ratio of 2/L. For fairness, we set g = 32, which matches the load ratio of 1/8 with page size L = 16 as implemented in the Quest baseline. In contrast, H2O maintains a dynamic pool of size k/l during the selection stage. Therefore, when k = l/10, the CAR of the three methods can be summarized in Tab. 1. While retrieval does introduce additional overhead during the selection compared to eviction, its goal is to improve performance with only a modest CAR increase.  Datasets. We evaluate Fier on the language modeling task PG19 (Rae et al., 2019). To assess its performance on long-context QA, we further conduct experiments on LongBench <ref type="bibr">(Bai et al., 2023)</ref> using six representative datasets: Narra-tiveQA, HotpotQA, Qasper, TriviaQA, GovReport, and MultifieldQA. The detailed information about the six datasets is in Appendix A. We evaluate Fier on the passkey retrieval task <ref type="bibr">(Peng et al., 2023)</ref> to assess its ability to model long-range dependencies. We also compare responses from Fier and Quest enabled chatbots in Appendix E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models.</head><p>We apply our method to three open-sourced models:</p><p>LLaMA-3-8B-Instruct <ref type="bibr">(Grattafiori et al., 2024)</ref>, LongChat-v1.5-7B-32k <ref type="bibr">(Li et al., 2023)</ref>, and Mistral-7B-Instruct <ref type="bibr">(Jiang et al., 2023)</ref>. Following the same setup as in Quest, neither Fier nor the baseline methods are applied to the first two layers of the model. We evaluate the performance of each method under varying KV cache budgets. Baselines. We thoroughly compare the performance of Fier and Quest <ref type="bibr">(Tang et al., 2024)</ref> across various benchmarks. Note that in all performance evaluation, we set the page size of Quest to 16 and the grouping size of Fier to 32 for a fair comparison. We also compare Fier with four KV eviction baselines: H2O <ref type="bibr">(Zhang et al., 2023)</ref>, StreamingLLM <ref type="bibr">(Xiao et al., 2023)</ref>, SnapKV <ref type="bibr">(Li et al., 2024) and</ref><ref type="bibr">TOVA (Oren et al., 2024)</ref>. The results in the experiments are either taken from the original paper or obtained by running open-source code. More implementation details can be found in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Insight Verification</head><p>More Accurate Retrieval of Important Tokens. In Fig. <ref type="figure">6</ref>, we visualize the positions of Top-64 tokens selected by Quest with different page sizes and by Fier-1bit-g32, all mapped back onto the full KV cache. We then compute the recall rate, defined as the overlap between the retrieved tokens and those selected using the full attention score. The experiment is conducted on LLaMA.</p><p>We observe that Quest with either small or large page sizes tends to retain unimportant tokens and evict important ones, due to its coarse-grained page-level retrieval. In contrast, Fier performs token-level retrieval through low-bit quantized attention computation, resulting in a significantly higher recall rate and better alignment with the full attention distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Performance Evaluation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">PG19 Results</head><p>We begin by evaluating language modeling perplexity on PG19, a benchmark consisting of 100 books with an average length of 70k tokens. We evaluate three different models by feeding them texts of varying lengths and compare the results against both Full KV and Quest (Fig. <ref type="figure">5</ref>). Note that both Fier and Quest are evaluated under the same KV cache budget of 4096 tokens. Fier achieves performance close to that of Full KV and significantly outperforms Quest on both the LLaMA and Mistral models.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Longbench Results</head><p>We evaluate on the LongBench benchmark using LLaMA-3-8B-Instruct across a diverse set of long-context tasks, including single-document QA (NarrativeQA, Qasper, MultiFieldQA), multidocument QA (HotpotQA), summarization (Gov-Report), and few-shot learning (TriviaQA). We also compare Fier with H2O, StreamingLLM, SnapKV, and Quest under varying KV cache budget settings. In addition, we perform evaluations using the Mistral-7B and LongChat-7B models to verify the generality of our method across different model architectures.</p><p>As shown in Fig. <ref type="figure">7</ref>, Fier consistently achieves superior performance compared to all baselines across six long-context datasets under various KV cache budgets. Overall, Fier surpasses all baselines at cache budgets of 512, 1024, and 2048 tokens, and achieves comparable performance to full KV using only 1k tokens, which is only 11% of the full cache. This suggests that Fier benefits from accurately retrieving critical tokens, enabling efficient use of limited cache resources without compromising model quality. Similar results are observed on the other two models. As shown in Tab. 2, Fier shows consistent improvements on both single-document and multi-document tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">Passkey Retrieval</head><p>We further evaluate Fier's ability to handle longdistance dependencies. Specifically, we employ the passkey retrieval task <ref type="bibr">(Peng et al., 2023)</ref> as our benchmark. This task assesses whether the model can retrieve a simple passkey from a large amount of irrelevant content. Following the setup in <ref type="bibr">Tang et al. (2024)</ref>, we use a context length of 10k tokens for evaluation with both LongChat-7B and a smaller model, LLaMA3-1B. KV eviction methods perform poorly due to their inability to recall discarded tokens, while Quest provides noticeable improvements over them. Fier, however, achieves even higher retrieval accuracy, performing well under extremely low budgets of just  32 64 tokens, especially improving the longcontext processing capability of smaller models (shown in Tab. 3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Ablation Study</head><p>Token Granularity vs. Quantized Attention.</p><p>To understand whether Fier's performance gain primarily stems from its fine-grained token-level selection (as opposed to page-level) or from the use of quantized attention as an importance metric, we conduct an ablation study on LLaMA-3-8B-Instruct, isolating these two factors. Specifically, we reduce the page size in Quest to approximate finer granularity and compare performance. In addition, we apply the averaged quantized attention score as the page-level importance metric under the same page size, and evaluate its effect on Quest. As shown in Tab. 4, using smaller page sizes in Quest leads to improved per- formance. However, it also increases the cache load ratio. Additionally, incorporating quantized attention for scoring further enhances its effectiveness. Notably, Fier can be viewed as quantized attention with a page size of 1, achieving the best overall results. These results suggest that Fier benefits from both the token-level granularity and the use of quantized attention as a lightweight yet effective importance estimator. Fier w/ Different Group Sizes. We also investigate how the group size used during key quantization affects Fier's performance. We find that as the group size increases, the cache load ratio decreases, but this comes at the cost of reduced performance. Nevertheless, Fier consistently outperforms Quest under the same cache load ratio.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Efficiency Profiling</head><p>Inference Efficiency. In Fig. <ref type="figure">8</ref>, we present the decoding latency of 256 tokens on LLaMA-2-7B under different prefill context lengths. To ensure a fair comparison with Full KV, we include both the time spent on computing quantized attention and the time required to recall the selected Topk tokens. We implement the group-wise quantization kernel using Triton, and employ the Top-k CUDA operator to efficiently perform Top-k token recall. Fier's efficiency gain is mainly attributed to the speedup in the self-attention computation, as it restricts attention to only a small subset of the KV cache. This acceleration becomes more pronounced as the context length increases; for instance, at a context length of 32k tokens, Fier achieves over 1.5&#215; decoding speedup. We also provide a detailed latency breakdown comparison of Fier and Quest in Appendix D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We present Fier, a fine-grained and efficient KV cache retrieval algorithm that selects important tokens using 1-bit quantization. By involving all tokens in the computation, Fier enables token-level criticality estimation, leading to improved recall rate and model performance. Extensive experiments across various tasks and model architectures show that Fier consistently outperforms existing methods. Notably, Fier matches full cache performance using only 11% of the KV budget and achieves a 1.2&#215;-1.5&#215; decoding speedup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations</head><p>Model Scale. Due to limited computational resources, our experiments are restricted to models up to 8B parameters. Evaluating Fier on larger models (e.g., 13B, 70B) may reveal further insights into its scalability and effectiveness. System Optimization. Our current implementation uses Triton to develop low-bit operators for quantized attention. While Triton offers flexibility and ease of development, it does not match the low-level optimization potential of custom CUDA kernels, potentially limiting the achievable inference speedup. Compatibility with GQA. Fier is not yet integrated with grouped-query attention (GQA) <ref type="bibr">(Ainslie et al., 2023)</ref>. This is because token pruning and grouped-query attention are orthogonal in principle: GQA reduces the number of KV heads, while token pruning reduces the number of tokens.</p><p>Exploring their compatibility remains an important direction for future work.</p><p>Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. 2024. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. Advances in Neural Information Processing Systems, 37:3304-3331. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R&#233;, Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661-34710.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Dataset Details</head><p>We use a subset of LongBench <ref type="bibr">(Bai et al., 2023)</ref> for long-context QA evaluation. Tab. 5 shows the statistics and evaluation metrics used in the experiments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Implementation Details</head><p>For experiments on LongBench <ref type="bibr">(Bai et al., 2023)</ref> and PG19 (Rae et al., 2019), we use three models: LLaMA-3-8B-Instruct <ref type="bibr">(Grattafiori et al., 2024)</ref>, LongChat-v1.5-7B-32k <ref type="bibr">(Li et al., 2023)</ref>, and Mistral-7B-Instruct <ref type="bibr">(Jiang et al., 2023)</ref>, with the maximum input length uniformly set to 32k tokens to ensure a fair comparison. All inference runs are conducted on NVIDIA A6000 GPUs. During the self-attention phase, we utilize FlashAttention <ref type="bibr">(Dao, 2023)</ref> for acceleration, except for H2O <ref type="bibr">(Zhang et al., 2023)</ref>, which requires computation of historical attention scores and therefore cannot benefit from FlashAttention. Except for H2O <ref type="bibr">(Zhang et al., 2023)</ref> and Quest <ref type="bibr">(Tang et al., 2024)</ref>, which are run using their publicly released implementations, all other baselines are run using the unified KV Cache-Factory framework <ref type="bibr">(Cai et al., 2024)</ref>. Experiments on observations and decoding latency are conducted separately on NVIDIA RTX 4090 GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Comparison of Fier and MiKV</head><p>Different from the methods discussed in the main text, MiKV <ref type="bibr">(Yang et al., 2024)</ref> adopts a hybrid strategy: during decoding, it stores top tokens in high precision while retaining the remaining tokens in a low-bit format. The main distinction between Fier and MiKV is that MiKV maintains the entire KV cache in mixed precision, whereas Fier only preserves top tokens during decoding. To ensure fairness, we evaluated Fier on the MMLU dataset <ref type="bibr">(Hendrycks et al., 2020)</ref> using the same cache size as reported for MiKV. Note that the cache size of Fier accounts for the CAR in the selection phase, which is not explicitly stated in the MiKV. As shown in Tab. 6, Fier demonstrates better performance compared to MiKV under the same cache budget. We provide a detailed latency breakdown between the pipelines of Fier and Quest in Tab. 7. The primary difference lies in the importance estimation stage: Fier uses an extreme 1-bit quantized attention, while Quest adopts page-level attention. In the top-k recall stage: Quest retrieves pages, while Fier directly retrieves top-k tokens. The final topk self-attention is identical for both methods. In Tab. 7, we compare the average per-layer latency for each token, as well as the end-to-end generation latency. With the same k, Fier has higher latency than Quest due to its fine-grained top-k sorting in the recall stage. However, Fier achieves better performance. By reducing k , Fier can achieve lower latency than Quest while still maintaining better performance. Overall, Fier offers a better trade-off between latency and performance compared to Quest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Comparison of Chatbot Responses</head><p>To better illustrate the practical differences between the two retrieval methods, we deploy a LLaMA-3-8B-Instruct chatbot using Fier and Quest, respectively. Given the same long context and user question, we present the corresponding responses from each chatbot for a qualitative comparison (Fig. <ref type="figure">9</ref>). In the first example, which asks about the orders in which Mufti-e-Azam-e-Hind received Khilafat, the Fier-enabled chatbot</p></div></body>
		</text>
</TEI>
