An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4--27.9× faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7--26.3× lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM -- which is the most expensive component in today's servers -- from being stranded.
more »
« less
This content will become publicly available on June 20, 2026
In-Storage Acceleration of Retrieval Augmented Generation as a Service
Retrieval-augmented generation (RAG) services are rapidly gaining adoption in enterprise settings as they combine information retrieval systems (e.g., databases) with large language models (LLMs) to enhance response generation and reduce hallucinations. By augmenting an LLM’s fixed pre-trained knowledge with real-time information retrieval, RAG enables models to effectively extend their context to large knowledge bases by selectively retrieving only the most relevant information. As a result, RAG provides the effect of dynamic updates to the LLM’s knowledge without requiring expensive and time-consuming retraining. While some deployments keep the entire database in memory, RAG services are increasingly shifting toward persistent storage to accommodate ever-growing knowledge bases, enhance utility, and improve cost-efficiency. However, this transition fundamentally reshapes the system’s performance profile: empirical analysis reveals that the Search & Retrieval phase emerges as the dominant contributor to end-to-end latency. This phase typically involves (1) running a smaller language model to generate query embeddings, (2) executing similarity and relevance checks over varying data structures, and (3) performing frequent, long-latency accesses to persistent storage. To address this triad of challenges, we propose a metamorphic in-storage accelerator architecture that provides the necessary programmability to support diverse RAG algorithms, dynamic data structures, and varying computational patterns. The architecture also supports in-storage execution of smaller language models for query embedding generation while final LLM generation is executed on DGX A100 systems. Experimental results show up to 4.3 × and 1.5 × improvement in end-to-end throughput compared to conventional retrieval pipelines using Xeon CPUs with NVMe storage and A100 GPUs with DRAM, respectively.
more »
« less
- Award ID(s):
- 2107598
- PAR ID:
- 10633314
- Publisher / Repository:
- Proceedings of the 52nd Annual International Symposium on Computer Architecture
- Date Published:
- ISBN:
- 9798400712616
- Format(s):
- Medium: X
- Location:
- Tokyo Japan
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external data sources beyond their training sets and querying predefined knowledge bases to generate accurate, context-rich responses. Most RAG implementations use vector similarity searches, but the effectiveness of this approach and the representation of knowledge bases remain underexplored. Emerging research suggests knowledge graphs as a promising solution. Therefore, this paper presents StructuGraphRAG, which leverages document structures to inform the extraction process and constructs knowledge graphs to enhance RAG for social science research, specifically using NSDUH datasets. Our method parses document structures to extract entities and relationships, constructing comprehensive and relevant knowledge graphs. Experimental results show that StructuGraphRAG outperforms traditional RAG methods in accuracy, comprehensiveness, and contextual relevance. This approach provides a robust tool for social science researchers, facilitating precise analysis of social determinants of health and justice, and underscores the potential of structured document-informed knowledge graph construction in AI and social science research.more » « less
-
As IoT device adoption grows, ensuring cybersecurity compliance with IoT standards, like National Institute of Standards and Technology Interagency (NISTIR) 8259A, has become increasingly complex. These standards are typically presented in lengthy, text-based formats that are difficult to process and query automatically. We built a knowledge graph to address this challenge to represent the key concepts, relationships, and references within NISTIR 8259A. We further integrate this knowledge graph with Retrieval-Augmented Generation (RAG) techniques that can be used by large language models (LLMs) to enhance the accuracy and contextual relevance of information retrieval. Additionally, we evaluate the performance of RAG using both graph-based queries and vector database embeddings. Our framework, implemented in Neo4j, was tested using multiple LLMs, including LLAMA2, Mistral-7B, and GPT-4. Our findings show that combining knowledge graphs with RAG significantly improves query precision and contextual relevance compared to unstructured vector-based retrieval methods. While traditional rule-based compliance tools were not evaluated in this study, our results demonstrate the advantages of structured, graph driven querying for security standards like NISTIR 8259A.more » « less
-
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose {\name}, a robust RAG framework for SLMs via Margin-aware Preference Optimization. {\name} employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs.more » « less
-
null (Ed.)Classical information retrieval systems such asBM25 rely on exact lexical match and carryout search efficiently with inverted list index. Recent neural IR models shifts towards soft semantic matching all query document terms,but they lose the computation efficiency of exact match systems.This paper presents COIL, a contextualized exact match retrieval architecture that brings semantic lexical matching. COIL scoring is based on overlapping query document tokens’ contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep language models. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.more » « less
An official website of the United States government
