Aliannejadi, M
; Faggioli, G
; Ferro, N
; Vlachos, M.
(Ed.)
This work discusses the participation of CS_Morgan in the Concept Detection and Caption
Prediction tasks of the ImageCLEFmedical 2023 Caption benchmark evaluation campaign.
The goal of this task is to automatically identify relevant concepts and their locations in images,
as well as generate coherent captions for the images. The dataset used for this task is a subset
of the extended Radiology Objects in Context (ROCO) dataset. The implementation approach
employed by us involved the use of pre-trained Convolutional Neural Networks (CNNs),
Vision Transformer (ViT), and Text-to-Text Transfer Transformer (T5) architectures. These
models were leveraged to handle the different aspects of the tasks, such as concept detection
and caption generation. In the Concept Detection task, the objective was to classify multiple
concepts associated with each image. We utilized several deep learning architectures with
‘sigmoid’ activation to enable multilabel classification using the Keras framework. We
submitted a total of five (5) runs for this task, and the best run achieved an F1 score of 0.4834,
indicating its effectiveness in detecting relevant concepts in the images. For the Caption
Prediction task, we successfully submitted eight (8) runs. Our approach involved combining
the ViT and T5 models to generate captions for the images. For the caption prediction task, the
ranking is based on the BERTScore, and our best run achieved a score of 0.5819 based on
generating captions using the fine-tuned T5 model from keywords generated using the pretrained
ViT as the encoder.
more »
« less