skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 24, 2026

Title: MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in factual accuracy in the factual accuracy of Med-LVLMs.  more » « less
Award ID(s):
2340241
PAR ID:
10601193
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
ICLR
Date Published:
ISBN:
978-1-7138-9865-8
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model’s generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RAFE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. 
    more » « less
  2. Retrieval-augmented generation (RAG) systems can effectively address user queries by leveraging indexed document corpora to retrieve the relevant contexts. Ranking techniques have been adopted in RAG systems to sort the retrieved contexts by their relevance to the query so that users can select the most useful contexts for their downstream tasks. While many existing ranking methods rely on the similarity between the embedding vectors of the context and query to measure relevance, it is important to note that similarity does not equate to relevance in all scenarios. Some ranking methods use large language models (LLMs) to rank the contexts by putting the query and the candidate contexts in the prompt and asking LLM about their relevance. The scalability of those methods is contingent on the number of candidate contexts and the context window of those LLMs. Also, those methods require fine-tuning the LLMs, which can be computationally expensive and require domain-related data. In this work, we propose a scalable ranking framework that does not involve LLM training. Our framework uses an off-the-shelf LLM to hypothesize the user's query based on the retrieved contexts and ranks the contexts based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other context retrieval and ranking techniques. Experimental results show that our method improves the ranking performance of retrieval systems in multiple benchmarks. 
    more » « less
  3. Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model’s responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks. 
    more » « less
  4. In closed-domain Question Answering (QA), Large Language Models (LLMs) often fail to deliver responses specialized enough for niche subdomains. Broadly trained models may not capture the nuanced terminology and contextual precision required in these fields, which frequently lack domain-specific conversational data and face computational constraints. To address this, we propose a methodology leveraging a Retrieval-Augmented Generation (RAG) framework that integrates data extraction with fine-tuning using domain-specific question-answer pairs. Our approach employs Question-Answer Generation (QAG) to create tailored training datasets, enabling fine-tuned models to incorporate specialized jargon and context while remaining computationally accessible to domain experts. To exemplify this methodology, we demonstrate its application within the medical domain through a case study centered on the creation of a dementia care chat assistant. A significant benefit of this approach lies in its ease of replication across various domains and scalability for integration into diverse user groups, making it a versatile solution for enhancing chat assistants. 
    more » « less
  5. Retrieval Augmented Generation (RAG) has been a recent improvement in providing recent and accurate data to Large Language Models (LLMs). Although RAG has been successful in reducing hallucinations within LLMs, it remains susceptible to inaccurate and maliciously manipulated data. In this paper, we present Distributed-RAG (D-RAG), a novel blockchain-based framework designed to increase the integrity of the RAG system. D-RAG addresses the risks of malicious data by replacing the RAG’s traditionally centralized database with communities, each consisting of a database and a permissioned blockchain. The communities are based on different subjects, each containing experts in the field who verify data through a privacy-preserving consensus protocol before it is added to the database. A Retrieval Blockchain is also designed to communicate between the multiple communities. The miners on this Retrieval Blockchain are responsible for retrieving documents from the database for each query and ranking them using an LLM. These rankings are agreed upon, and the top ranked documents are provided to the LLM with the query to generate a response. We perform experiments on our proposed D-RAG framework, and our results show that our Retrieval Blockchain is scalable and our privacy-preserving consensus protocol maintains efficiency as community members increase. These results demonstrate that in a real-world application setting D-RAG is scalable in maintaining data integrity. 
    more » « less