skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 23, 2026

Title: Advancing Medical Video Question Answering Through Large Language Models, Temporal Localization and Causal Reasoning
In this research, we take an innovative approach to the Video Corpus Visual Answer Localization (VCVAL) task using the MedVidQA dataset. We expand on it by incorporating causal inference for medical videos, a novel approach in this field. By leveraging the state-of-the-art GPT-4 and Gemini Pro 1.5 models, the system aims to localize temporal segments in videos and analyze cause-effect relationships from subtitles to enhance medical decision-making. This paper extends the work from the MedVidQA challenge by introducing causality extraction to enhance the interpretability of localized video content. Subtitles are segmented to identify causal units such as cause, effect, condition, action, and signal. Prompts guide GPT-4 and Gemini Pro 1.5 in detecting and quantifying causal structures while analyzing explicit and implicit relationships, including those spanning multiple subtitle fragments. Our results reveal that both GPT-4 and Gemini Pro 1.5 perform better when handling queries individually but face challenges in batch processing for both temporal localization and causality extraction. Despite these challenges, our innovative approach has the potential to significantly advance the field of Health Informatics. In this research, we address the Video Corpus Visual Answer Localization (VCVAL) task using the MedVidQA dataset and take it a step further by integrating causal inference for medical videos. By leveraging the state-of-the-art GPT-4 and Gemini Pro 1.5 model, our system is designed to localize temporal segments in videos and analyze cause-effect relationships from subtitles to enhance medical decision-making. Our preliminary results indicate that while both models perform well for some videos, they face challenges for most, resulting in varying performance levels. The successful integration of temporal localization with causal inference can provide significant improvement for the scalability and overall performance of medical video analysis. Our work demonstrates how AI systems can uncover valuable insights from medical videos, driving significant progress in medical AI applications and potentially making significant contributions to the field.  more » « less
Award ID(s):
2141124
PAR ID:
10584359
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications
Date Published:
Subject(s) / Keyword(s):
Medical Video QA, Causal Reasoning, Multimodal LLMs, Health Informatics, AI in Health
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This article examines the challenges and opportunities in extracting causal information from text with Large Language Models (LLMs). It first establishes the importance of causality extraction and then explores different views on causality, including common sense ideas informing different data annotation schemes, Aristotle’s Four Causes, and Pearl’s Ladder of Causation. The paper notes the relevance of this conceptual variety for the task. The text reviews datasets and work related to finding causal expressions, both using traditional machine learning methods and LLMs. Although the known limitations of LLMs—hallucinations and lack of common sense—affect the reliability of causal findings, GPT and Gemini models (GPT-5 and Gemini 2.5 Pro and others) show the ability to conduct causality analysis; moreover, they can even apply different perspectives, such as counterfactual and Aristotelian. They are also capable of explaining and critiquing causal analyses: we report an experiment showing that in addition to largely flawless analyses, the newer models exhibit very high agreement of 88–91% on causal relationships between events—much higher than the typically reported inter-annotator agreement of 30–70%. The article concludes with a discussion of the lessons learned about these challenges and questions how LLMs might help address them in the future. For example, LLMs should help address the sparsity of annotated data. Moreover, LLMs point to a future where causality analysis in texts focuses not on annotations but on understanding, as causality is about semantics and not word spans. The Appendices and shared data show examples of LLM outputs on tasks involving causal reasoning and causal information extraction, demonstrating the models’ current abilities and limits. 
    more » « less
  2. The need for efficiently finding the video content a user wants is increasing because of the erupting of user-generated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatio-temporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task. 
    more » « less
  3. Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language modeldriven method for negative query construction, utilizing GPT-3.5 Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE. 
    more » « less
  4. Temporal grounding, a.k.a video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at this https URL. 
    more » « less
  5. Exposing students to low-quality assessments such as multiple-choice questions (MCQs) and short answer questions (SAQs) is detrimental to their learning, making it essential to accurately evaluate these assessments. Existing evaluation methods are often challenging to scale and fail to consider their pedagogical value within course materials. Online crowds offer a scalable and cost-effective source of intelligence, but often lack necessary domain expertise. Advancements in Large Language Models (LLMs) offer automation and scalability, but may also lack precise domain knowledge. To explore these trade-offs, we compare the effectiveness and reliability of crowdsourced and LLM-based methods for assessing the quality of 30 MCQs and SAQs across six educational domains using two standardized evaluation rubrics. We analyzed the performance of 84 crowdworkers from Amazon's Mechanical Turk and Prolific, comparing their quality evaluations to those made by the three LLMs: GPT-4, Gemini 1.5 Pro, and Claude 3 Opus. We found that crowdworkers on Prolific consistently delivered the highest-quality assessments, and GPT-4 emerged as the most effective LLM for this task. Our study reveals that while traditional crowdsourced methods often yield more accurate assessments, LLMs can match this accuracy in specific evaluative criteria. These results provide evidence for a hybrid approach to educational content evaluation, integrating the scalability of AI with the nuanced judgment of humans. We offer feasibility considerations in using AI to supplement human judgment in educational assessment. 
    more » « less