NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Medical Image Interpretation with Large Multimodal Models

Hoque, M; Hasan, M; Emon, M; Khalifa, F; Rahman, M (September 2024, CEUR workshop proceedings)
Faggioli, G; Ferro, N; Galušcáková, P; Herrera, A (Ed.)
Full Text Available
Advancing ai-powered medical image synthesis: Insights from medvqa-gi challenge using clip, fine-tuned stable diffusion, and dream-booth + lora

Ojonugwa, E; Rahman, M; Khalifa, F (September 2024, CEUR workshop proceedings)
Faggioli, G; Ferro, N; Galuščáková, P; Herrera, A (Ed.)
The MEDVQA-GI challenge addresses the integration of AI-driven text-to-image generative models in medical diagnostics, aiming to enhance diagnostic capabilities through synthetic image generation. Existing methods primarily focus on static image analysis and lack the dynamic generation of medical imagery from textual descriptions. This study intends to partially close this gap by introducing a novel approach based on fine-tuned generative models to generate dynamic, scalable, and precise images from textual descriptions. Particularly, our system integrates fine-tuned Stable Diffusion and DreamBooth models, as well as Low-Rank Adaptation (LORA), to generate high-fidelity medical images. The problem is around two sub-tasks namely: image synthesis (IS) and optimal prompt production (OPG). The former creates medical images via verbal prompts, whereas the latter provides prompts that produce high-quality images in specified categories. The study emphasizes the limitations of traditional medical image generation methods, such as hand sketching, constrained datasets, static procedures, and generic models. Our evaluation measures showed that Stable Diffusion surpasses CLIP and DreamBooth + LORA in terms of producing high-quality, diversified images. Specifically, Stable Diffusion had the lowest Fréchet Inception Distance (FID) scores (0.099 for single center, 0.064 for multi-center, and 0.067 for combined), indicating higher image quality. Furthermore, it had the highest average Inception Score (2.327 across all datasets), indicating exceptional diversity and quality. This advances the field of AI-powered medical diagnosis. Future research will concentrate on model refining, dataset augmentation, and ethical considerations for efficiently implementing these advances into clinical practice.
more » « less
Full Text Available
Fingerprint Identification of Generative Models Using a MultiFormer Ensemble Approach

Emon, M; Hoque, M; Hasan, M; Khalifa, F; Rahman, M (September 2024, CEUR workshop proceedings)
Faggioli, G; Ferro, N; Galuščáková, P; Herrera, A (Ed.)
In the ever-changing realm of medical image processing, ImageCLEF brought a newdimension with the Identifying GAN Fingerprint task, catering to the advancement of visual media analysis. This year, the author presented the task of detecting training image fingerprints to control the quality of synthetic images for the second time (as task 1) and introduced the task of detecting generative model fingerprints for the first time (as task 2). Both tasks are aimed at discerning these fingerprints from images, on both real training images and the generative models. The dataset utilized encompassed 3D CT images of lung tuberculosis patients, with the development dataset featuring a mix of real and generated images, and the test dataset. Our team ’CSMorgan’ contributed several approaches, leveraging multiformer (combined feature extracted using BLIP2 and DINOv2) networks, additive and mode thresholding techniques, and late fusion methodologies, bolstered by morphological operations. In Task 1, our optimal performance was attained through a late fusion-based reranking strategy, achieving an F1 score of 0.51, while the additive average thresholding approach closely followed with a score of 0.504. In Task 2, our multiformer model garnered an impressive Adjusted Rand Index (ARI) score of 0.90, and a fine-tuned variant of the multiformer yielded a score of 0.8137. These outcomes underscore the efficacy of the multiformer-based approach in accurately discerning both real image and generative model fingerprints.
more » « less
Full Text Available
Medical Image Interpretation with Large Multimodal Models

Hoque, M; Hasan, M; Emon, M; Khalifa, F; Rahman, M (September 2024, CEUR workshop proceedings)
Faggioli, G; Ferro, N; Galuščáková, P; de, A (Ed.)
This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, different variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture—a hybrid approach leveraging CNN and Transformer advantages—ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images.
more » « less
Full Text Available
Concept Detection and Caption Prediction in ImageCLEFmedical Caption 2023 with Convolutional Neural Networks; Vision and Text-to-Text Transfer Transformers

Hasan M; Layode O; Rahman M. (October 2023, Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023))
Aliannejadi, M; Faggioli, G; Ferro, N; Vlachos, M. (Ed.)
This work discusses the participation of CS_Morgan in the Concept Detection and Caption Prediction tasks of the ImageCLEFmedical 2023 Caption benchmark evaluation campaign. The goal of this task is to automatically identify relevant concepts and their locations in images, as well as generate coherent captions for the images. The dataset used for this task is a subset of the extended Radiology Objects in Context (ROCO) dataset. The implementation approach employed by us involved the use of pre-trained Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and Text-to-Text Transfer Transformer (T5) architectures. These models were leveraged to handle the different aspects of the tasks, such as concept detection and caption generation. In the Concept Detection task, the objective was to classify multiple concepts associated with each image. We utilized several deep learning architectures with ‘sigmoid’ activation to enable multilabel classification using the Keras framework. We submitted a total of five (5) runs for this task, and the best run achieved an F1 score of 0.4834, indicating its effectiveness in detecting relevant concepts in the images. For the Caption Prediction task, we successfully submitted eight (8) runs. Our approach involved combining the ViT and T5 models to generate captions for the images. For the caption prediction task, the ranking is based on the BERTScore, and our best run achieved a score of 0.5819 based on generating captions using the fine-tuned T5 model from keywords generated using the pretrained ViT as the encoder.
more » « less
Full Text Available
Media Interestingness Prediction in ImageCLEFfusion 2023 with Dense Architecture-based Ensemble & Scaled; Gradient Boosting Regressor Model

Emon, M; Rahman, M. (October 2023, CEUR workshop proceedings)
Aliannejadi, M; Faggioli, G; Ferro, N; Vlachos, M. (Ed.)
The field of computer vision plays a key role in managing, processing, analyzing, and interpreting multimedia data in diverse applications. Visual interestingness in multimedia contents is crucial for many practical applications, such as search and recommendation. Determining the interestingness of a particular piece of media content and selecting the highest-value item in terms of content analysis, viewers’ perspective, content classification, and scoring media are sophisticated tasks to perform due to the heavily subjective nature. This work presents the approaches of the CS_Morgan team by participating in the media interestingness prediction task under ImageCLEFfusion 2023 benchmark evaluation. We experimented with two ensemble methods which contain a dense architecture and a gradient boosting scaled architecture. For the dense architecture, several hyperparameters tunings are performed and the output scores of all the inducers after the dense layers are combined using min-max rule. The gradient boost estimator provides an additive model in staged forward propagation, which allows an optimized loss function. For every step in the ensemble gradient boosting scaled (EGBS) architecture, a regression tree is fitted to the negative gradient of the loss function. We achieved the best accuracy with a MAP@10 score of 0.1287 by using the ensemble EGBS.
more » « less
Full Text Available
CS_Morgan at ImageCLEFmedical 2022 Caption Task: Deep Learning Based Multi-Label Classification and Transformers for Concept Detection & Caption Prediction

Rahman, Md M.; Layode, O. (August 2022, CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy, CEUR Workshop Proceedings (CEUR-WS.org) Proceedings)
Faggioli, G.; Ferro, N.; Hanbury, A.; Potthast, M. (Ed.)
This paper describes the participation of Morgan_CS in both Concept Detection and Caption Prediction tasks under the ImageCLEFmedical 2022 Caption task. The task required participants to automatically identifying the presence and location of relevant concepts and composing coherent captions for the entirety of an image in a large corpus which is a subset of the extended Radiology Objects in COntext (ROCO) dataset. Our implementation is motivated by using encoder-decoder based sequence-to-sequence model for caption and concept generation using both pre-trained Text and Vision Transformers (ViTs). In addition, the Concept Detection task is also considered as a multi concept labels classification problem where several deep learning architectures with “sigmoid” activation are used to enable multilabel classification with Keras. We have successfully submitted eight runs for the Concept Detection task and four runs for the Caption Prediction task. For the Concept Detection Task, our best model achieved an F1 score of 0.3519 and for the Caption Prediction Task, our best model achieved a BLEU Score of 0.2549 while using a fusion of Transformers.
more » « less
Full Text Available

Search for: All records