skip to main content


This content will become publicly available on October 4, 2024

Title: Concept Detection and Caption Prediction in ImageCLEFmedical Caption 2023 with Convolutional Neural Networks; Vision and Text-to-Text Transfer Transformers
This work discusses the participation of CS_Morgan in the Concept Detection and Caption Prediction tasks of the ImageCLEFmedical 2023 Caption benchmark evaluation campaign. The goal of this task is to automatically identify relevant concepts and their locations in images, as well as generate coherent captions for the images. The dataset used for this task is a subset of the extended Radiology Objects in Context (ROCO) dataset. The implementation approach employed by us involved the use of pre-trained Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and Text-to-Text Transfer Transformer (T5) architectures. These models were leveraged to handle the different aspects of the tasks, such as concept detection and caption generation. In the Concept Detection task, the objective was to classify multiple concepts associated with each image. We utilized several deep learning architectures with ‘sigmoid’ activation to enable multilabel classification using the Keras framework. We submitted a total of five (5) runs for this task, and the best run achieved an F1 score of 0.4834, indicating its effectiveness in detecting relevant concepts in the images. For the Caption Prediction task, we successfully submitted eight (8) runs. Our approach involved combining the ViT and T5 models to generate captions for the images. For the caption prediction task, the ranking is based on the BERTScore, and our best run achieved a score of 0.5819 based on generating captions using the fine-tuned T5 model from keywords generated using the pretrained ViT as the encoder.  more » « less
Award ID(s):
2131207
NSF-PAR ID:
10476132
Author(s) / Creator(s):
; ;
Editor(s):
Aliannejadi, M; Faggioli, G; Ferro, N; Vlachos, M.
Publisher / Repository:
https://ceur-ws.org/Vol-3497/
Date Published:
Journal Name:
Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023)
ISSN:
1613-0073
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Faggioli, G. ; Ferro, N. ; Hanbury, A. ; Potthast, M. (Ed.)
    This paper describes the participation of Morgan_CS in both Concept Detection and Caption Prediction tasks under the ImageCLEFmedical 2022 Caption task. The task required participants to automatically identifying the presence and location of relevant concepts and composing coherent captions for the entirety of an image in a large corpus which is a subset of the extended Radiology Objects in COntext (ROCO) dataset. Our implementation is motivated by using encoder-decoder based sequence-to-sequence model for caption and concept generation using both pre-trained Text and Vision Transformers (ViTs). In addition, the Concept Detection task is also considered as a multi concept labels classification problem where several deep learning architectures with “sigmoid” activation are used to enable multilabel classification with Keras. We have successfully submitted eight runs for the Concept Detection task and four runs for the Caption Prediction task. For the Concept Detection Task, our best model achieved an F1 score of 0.3519 and for the Caption Prediction Task, our best model achieved a BLEU Score of 0.2549 while using a fusion of Transformers. 
    more » « less
  2. Evaluating the quality of accessible image captions with human raters is difficult, as it may be difficult for a visually impaired user to know how comprehensive a caption is, whereas a sighted assistant may not know what information a user will need from a caption. To explore how image captioners and caption consumers assess caption content, we conducted a series of collaborative captioning sessions in which six pairs, consisting of a blind person and their sighted partner, worked together to discuss, create, and evaluate image captions. By making captioning a collaborative task, we were able to observe captioning strategies, to elicit questions and answers about image captions, and to explore blind users’ caption preferences. Our findings provide insight about the process of creating good captions and serve as a case study for cross-ability collaboration between blind and sighted people. 
    more » « less
  3. null (Ed.)
    We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes. 
    more » « less
  4. Abstract Background Diabetic retinopathy (DR) is a leading cause of blindness in American adults. If detected, DR can be treated to prevent further damage causing blindness. There is an increasing interest in developing artificial intelligence (AI) technologies to help detect DR using electronic health records. The lesion-related information documented in fundus image reports is a valuable resource that could help diagnoses of DR in clinical decision support systems. However, most studies for AI-based DR diagnoses are mainly based on medical images; there is limited studies to explore the lesion-related information captured in the free text image reports. Methods In this study, we examined two state-of-the-art transformer-based natural language processing (NLP) models, including BERT and RoBERTa, compared them with a recurrent neural network implemented using Long short-term memory (LSTM) to extract DR-related concepts from clinical narratives. We identified four different categories of DR-related clinical concepts including lesions, eye parts, laterality, and severity, developed annotation guidelines, annotated a DR-corpus of 536 image reports, and developed transformer-based NLP models for clinical concept extraction and relation extraction. We also examined the relation extraction under two settings including ‘gold-standard’ setting—where gold-standard concepts were used–and end-to-end setting. Results For concept extraction, the BERT model pretrained with the MIMIC III dataset achieve the best performance (0.9503 and 0.9645 for strict/lenient evaluation). For relation extraction, BERT model pretrained using general English text achieved the best strict/lenient F1-score of 0.9316. The end-to-end system, BERT_general_e2e, achieved the best strict/lenient F1-score of 0.8578 and 0.8881, respectively. Another end-to-end system based on the RoBERTa architecture, RoBERTa_general_e2e, also achieved the same performance as BERT_general_e2e in strict scores. Conclusions This study demonstrated the efficiency of transformer-based NLP models for clinical concept extraction and relation extraction. Our results show that it’s necessary to pretrain transformer models using clinical text to optimize the performance for clinical concept extraction. Whereas, for relation extraction, transformers pretrained using general English text perform better. 
    more » « less
  5. This paper describes the participation of the Document and Pattern Recognition Lab from the Rochester Institute of Technology in the CLEF 2020 ARQMath lab. There are two tasks defined for ARQMath: (1) Question Answering, and (2) Formula Retrieval. Four runs were submitted for Task 1 using systems that take advantage of text and formula embeddings. For Task 2, three runs were submitted: one uses only formula embedding, another uses formula and text embeddings, and the final one uses formula embedding followed by re-ranking results by tree-edit distance. The Task 2 runs yielded strong results, the Task 1 results were less competitive. 
    more » « less