skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 27, 2026

Title: Understanding Synthetic Context Extension via Retrieval Heads
Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data have high overlap with retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.  more » « less
Award ID(s):
2505865
PAR ID:
10631890
Author(s) / Creator(s):
; ;
Publisher / Repository:
ICML 2025
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Retrieval-augmented generation (RAG) systems can effectively address user queries by leveraging indexed document corpora to retrieve the relevant contexts. Ranking techniques have been adopted in RAG systems to sort the retrieved contexts by their relevance to the query so that users can select the most useful contexts for their downstream tasks. While many existing ranking methods rely on the similarity between the embedding vectors of the context and query to measure relevance, it is important to note that similarity does not equate to relevance in all scenarios. Some ranking methods use large language models (LLMs) to rank the contexts by putting the query and the candidate contexts in the prompt and asking LLM about their relevance. The scalability of those methods is contingent on the number of candidate contexts and the context window of those LLMs. Also, those methods require fine-tuning the LLMs, which can be computationally expensive and require domain-related data. In this work, we propose a scalable ranking framework that does not involve LLM training. Our framework uses an off-the-shelf LLM to hypothesize the user's query based on the retrieved contexts and ranks the contexts based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other context retrieval and ranking techniques. Experimental results show that our method improves the ranking performance of retrieval systems in multiple benchmarks. 
    more » « less
  2. Recent work in interpretability shows that large language models (LLMs) can be adapted for new tasks in a learning-free way: it is possible to intervene on LLM representations to elicit desired behaviors for alignment. For instance, adding certain bias vectors to the outputs of certain attention heads is reported to boost the truthfulness of models. In this work, we show that localized fine-tuning serves as an effective alternative to such representation intervention methods. We introduce a framework called Localized Fine-Tuning on LLM Representations (LoFiT), which identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. LoFiT localizes to a sparse set of heads (3%-10%) and learns the offset vectors from limited training data, comparable to the settings used for representation intervention. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention. We also find that the localization step is important: selecting a task-specific set of attention heads can lead to higher performance than intervening on heads selected for a different task. Finally, across 7 tasks we study, LoFiT achieves comparable performance to other parameter-efficient fine-tuning methods such as LoRA, despite modifying 20x-200x fewer parameters than these methods. 
    more » « less
  3. Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data—which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of component known as induction heads. We found that OOD generalization and composition are tied together—models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis. 
    more » « less
  4. Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompttuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how the prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information. 
    more » « less
  5. Large language models (LLMs) with billions of parameters and pretrained on massive amounts of data are now capable of near or better than state-of-the-art performance in a variety of downstream natural language processing tasks. Neural machine translation (NMT) is one such task that LLMs have been applied to with great success. However, little research has focused on applying LLMs to the more difficult subset of NMT called simultaneous translation (SimulMT), where translation begins before the entire source context is available to the model. In this paper, we address key challenges facing LLMs fine-tuned for SimulMT, validate classical SimulMT concepts and practices in the context of LLMs, explore adapting LLMs that are fine-tuned for NMT to the task of SimulMT, and introduce Simul-LLM, the first open-source fine-tuning and evaluation pipeline development framework for LLMs focused on SimulMT. 
    more » « less