skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in factual accuracy in the factual accuracy of Med-LVLMs.  more » « less
Award ID(s):
2340241
PAR ID:
10601193
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
ICLR
Date Published:
ISBN:
978-1-7138-9865-8
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model’s generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RAFE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. 
    more » « less
  2. Large vision language models (LVLMs) have achieved superior performance on natural image and text tasks, inspiring extensive fine‐tuning research. However, their robustness against hallucination in clinical contexts remains understudied. We propose the Medical Visual Hallucination Test (MedVH), a novel evaluation framework assessing hallucination tendencies in both medical‐specific and general‐purpose LVLMs. MedVH encompasses six tasks targeting medical hallucinations, including two traditional tasks and four novel tasks formatted as multi‐choice visual question answering and long response generation. Our extensive experiments with six evaluation metrics reveal that medical LVLMs, despite promising performance on standard medical tasks, are particularly susceptible to hallucinations—often more so than general models. This raises significant concerns about domain‐specific model reliability. For real‐world applications, medical LVLMs must accurately integrate medical knowledge while maintaining robust reasoning to prevent hallucination. We explore mitigation methods without model‐specific fine‐tuning, including prompt engineering and collaboration between general and domain‐specific models. Our work provides a foundation for future evaluation studies. The dataset is available at PhysioNet: https://physionet.org/content/medvh. 
    more » « less
  3. Retrieval-augmented generation (RAG) systems can effectively address user queries by leveraging indexed document corpora to retrieve the relevant contexts. Ranking techniques have been adopted in RAG systems to sort the retrieved contexts by their relevance to the query so that users can select the most useful contexts for their downstream tasks. While many existing ranking methods rely on the similarity between the embedding vectors of the context and query to measure relevance, it is important to note that similarity does not equate to relevance in all scenarios. Some ranking methods use large language models (LLMs) to rank the contexts by putting the query and the candidate contexts in the prompt and asking LLM about their relevance. The scalability of those methods is contingent on the number of candidate contexts and the context window of those LLMs. Also, those methods require fine-tuning the LLMs, which can be computationally expensive and require domain-related data. In this work, we propose a scalable ranking framework that does not involve LLM training. Our framework uses an off-the-shelf LLM to hypothesize the user's query based on the retrieved contexts and ranks the contexts based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other context retrieval and ranking techniques. Experimental results show that our method improves the ranking performance of retrieval systems in multiple benchmarks. 
    more » « less
  4. Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model’s responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks. 
    more » « less
  5. Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines. 
    more » « less