skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Medical Image Interpretation with Large Multimodal Models
This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, different variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture—a hybrid approach leveraging CNN and Transformer advantages—ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images.  more » « less
Award ID(s):
2131207
PAR ID:
10561037
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Faggioli, G; Ferro, N; Galuščáková, P; de, A
Publisher / Repository:
CEUR Workshop Proceedings 3740, CEUR-WS.org 2024
Date Published:
Journal Name:
CEUR workshop proceedings
ISSN:
1613-0073
Format(s):
Medium: X
Location:
https://ceur-ws.org/Vol-3740/paper-151.pdf
Sponsoring Org:
National Science Foundation
More Like this
  1. Aliannejadi, M; Faggioli, G; Ferro, N; Vlachos, M. (Ed.)
    This work discusses the participation of CS_Morgan in the Concept Detection and Caption Prediction tasks of the ImageCLEFmedical 2023 Caption benchmark evaluation campaign. The goal of this task is to automatically identify relevant concepts and their locations in images, as well as generate coherent captions for the images. The dataset used for this task is a subset of the extended Radiology Objects in Context (ROCO) dataset. The implementation approach employed by us involved the use of pre-trained Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and Text-to-Text Transfer Transformer (T5) architectures. These models were leveraged to handle the different aspects of the tasks, such as concept detection and caption generation. In the Concept Detection task, the objective was to classify multiple concepts associated with each image. We utilized several deep learning architectures with ‘sigmoid’ activation to enable multilabel classification using the Keras framework. We submitted a total of five (5) runs for this task, and the best run achieved an F1 score of 0.4834, indicating its effectiveness in detecting relevant concepts in the images. For the Caption Prediction task, we successfully submitted eight (8) runs. Our approach involved combining the ViT and T5 models to generate captions for the images. For the caption prediction task, the ranking is based on the BERTScore, and our best run achieved a score of 0.5819 based on generating captions using the fine-tuned T5 model from keywords generated using the pretrained ViT as the encoder. 
    more » « less
  2. Faggioli, G.; Ferro, N.; Hanbury, A.; Potthast, M. (Ed.)
    This paper describes the participation of Morgan_CS in both Concept Detection and Caption Prediction tasks under the ImageCLEFmedical 2022 Caption task. The task required participants to automatically identifying the presence and location of relevant concepts and composing coherent captions for the entirety of an image in a large corpus which is a subset of the extended Radiology Objects in COntext (ROCO) dataset. Our implementation is motivated by using encoder-decoder based sequence-to-sequence model for caption and concept generation using both pre-trained Text and Vision Transformers (ViTs). In addition, the Concept Detection task is also considered as a multi concept labels classification problem where several deep learning architectures with “sigmoid” activation are used to enable multilabel classification with Keras. We have successfully submitted eight runs for the Concept Detection task and four runs for the Caption Prediction task. For the Concept Detection Task, our best model achieved an F1 score of 0.3519 and for the Caption Prediction Task, our best model achieved a BLEU Score of 0.2549 while using a fusion of Transformers. 
    more » « less
  3. Biomedical images are crucial for diagnosing and planning treatments, as well as advancing scientific understanding of various ailments. To effectively highlight regions of interest (RoIs) and convey medical concepts, annotation markers like arrows, letters, or symbols are employed. However, annotating these images with appropriate medical labels poses a significant challenge. In this study, we propose a framework that leverages multimodal input features, including text/label features and visual features, to facilitate accurate annotation of biomedical images with multiple labels. Our approach integrates state-of-the-art models such as ResNet50 and Vision Transformers (ViT) to extract informative features from the images. Additionally, we employ Generative Pre-trained Distilled-GPT2 (Transformer based Natural Language Processing architecture) to extract textual features, leveraging their natural language understanding capabilities. This combination of image and text modalities allows for a more comprehensive representation of the biomedical data, leading to improved annotation accuracy. By combining the features extracted from both image and text modalities, we trained a simplified Convolutional Neural Network (CNN) based multi-classifier to learn the image-text relations and predict multi-labels for multi-modal radiology images. We used ImageCLEFmedical 2022 and 2023 datasets to demonstrate the effectiveness of our framework. This dataset likely contains a diverse range of biomedical images, enabling the evaluation of the framework’s performance under realistic conditions. We have achieved promising results with the F1 score of 0.508. Our proposed framework exhibits potential performance in annotating biomedical images with multiple labels, contributing to improved image understanding and analysis in the medical image processing domain. 
    more » « less
  4. Abstract Color vision deficiency (CVD) affects a significant portion of the population, yet its impact is often overlooked in medical education, especially in visually demanding specialties like dermatology, pathology, and radiology. In this study, we investigated the potential of ChatGPT to comprehend CVD-simulated images in image-based diagnostic tasks. Notably, the model successfully adapted its diagnostic reasoning to match CVD-modified color perception while preserving high prediction accuracy. These findings highlight the potential of using ChatGPT to foster more inclusive learning environments for individuals with CVD in visually intensive medical specialties. 
    more » « less
  5. This paper presents the Hallucination Recognition Model for New Experiment Evaluation (HaRMoNEE) team’s winning (#1) and #10 submissions for SemEval-2024 Task 6: Sharedtask on Hallucinations and Related Observable Overgeneration Mistakes (SHROOM)’s two subtasks. This task challenged its participants to design systems to detect hallucinations in Large Language Model (LLM) outputs. Team HaRMoNEE proposes two architectures: (1) fine-tuning an off-the-shelf transformer-based model and (2) prompt tuning large-scale Large Language Models (LLMs). One submission from the fine-tuning approach outperformed all other submissions for the model-aware subtask; one submission from the prompt-tuning approach is the 10th-best submission on the leaderboard for the modelagnostic subtask. Our systems also include pre-processing, system-specific tuning, postprocessing, and evaluation. 
    more » « less