Biomedical images are crucial for diagnosing and planning treatments, as well as advancing scientific understanding of various ailments. To effectively highlight regions of interest (RoIs) and convey medical concepts, annotation markers like arrows, letters, or symbols are employed. However, annotating these images with appropriate medical labels poses a significant challenge. In this study, we propose a framework that leverages multimodal input features, including text/label features and visual features, to facilitate accurate annotation of biomedical images with multiple labels. Our approach integrates state-of-the-art models such as ResNet50 and Vision Transformers (ViT) to extract informative features from the images. Additionally, we employ Generative Pre-trained Distilled-GPT2 (Transformer based Natural Language Processing architecture) to extract textual features, leveraging their natural language understanding capabilities. This combination of image and text modalities allows for a more comprehensive representation of the biomedical data, leading to improved annotation accuracy. By combining the features extracted from both image and text modalities, we trained a simplified Convolutional Neural Network (CNN) based multi-classifier to learn the image-text relations and predict multi-labels for multi-modal radiology images. We used ImageCLEFmedical 2022 and 2023 datasets to demonstrate the effectiveness of our framework. This dataset likely contains a diverse range of biomedical images, enabling the evaluation of the framework’s performance under realistic conditions. We have achieved promising results with the F1 score of 0.508. Our proposed framework exhibits potential performance in annotating biomedical images with multiple labels, contributing to improved image understanding and analysis in the medical image processing domain.
more »
« less
Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Neural language models encode rich knowledge about entities and their relationships which can be extracted from their representations using probing. Common properties of nouns (e.g., red strawberries, small ant) are, however, more challenging to extract compared to other types of knowledge because they are rarely explicitly stated in texts. We hypothesize this to mainly be the case for perceptual properties which are obvious to the participants in the communication. We propose to extract these properties from images and use them in an ensemble model, in order to complement the information that is extracted from language models. We consider perceptual properties to be more concrete than abstract properties (e.g., interesting, flawless). We propose to use the adjectives’ concreteness score as a lever to calibrate the contribution of each source (text vs. images). We evaluate our ensemble model in a ranking task where the actual properties of a noun need to be ranked higher than other non-relevant properties. Our results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful text-based language models.
more »
« less
- Award ID(s):
- 1928474
- PAR ID:
- 10390707
- Date Published:
- Journal Name:
- Findings of The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A word in natural language can be polysemous, having multiple meanings, as well as synonymous, meaning the same thing as other words. Word sense induction attempts to find the senses of pol- ysemous words. Synonymy detection attempts to find when two words are interchangeable. We com- bine these tasks, first inducing word senses and then detecting similar senses to form word-sense synonym sets (synsets) in an unsupervised fashion. Given pairs of images and text with noun phrase labels, we perform synset induction to produce col- lections of underlying concepts described by one or more noun phrases. We find that considering multi- modal features from both visual and textual context yields better induced synsets than using either con- text alone. Human evaluations show that our unsu- pervised, multi-modally induced synsets are com- parable in quality to annotation-assisted ImageNet synsets, achieving about 84% of ImageNet synsets’ approval.more » « less
-
One hallmark of human reasoning is that we can bring to bear a diverse web of common-sense knowledge in any situation. The vastness of our knowledge poses a challenge for the practical implementation of reasoning systems as well as for our cognitive theories – how do people represent their common-sense knowledge? On the one hand, our best models of sophisticated reasoning are top-down, making use primarily of symbolically-encoded knowledge. On the other, much of our understanding of the statistical properties of our environment may arise in a bottom-up fashion, for example through asso- ciationist learning mechanisms. Indeed, recent advances in AI have enabled the development of billion-parameter language models that can scour for patterns in gigabytes of text from the web, picking up a surprising amount of common-sense knowledge along the way—but they fail to learn the structure of coherent reasoning. We propose combining these approaches, by embedding language-model-backed primitives into a state- of-the-art probabilistic programming language (PPL). On two open-ended reasoning tasks, we show that our PPL models with neural knowledge components characterize the distribution of human responses more accurately than the neural language models alone, raising interesting questions about how people might use language as an interface to common-sense knowledge, and suggesting that building probabilistic models with neural language-model components may be a promising approach for more human-like AI.more » « less
-
It is crucial to automatically construct knowledge graphs (KGs) of diverse new relations to support knowledge discovery and broad applications. Previous KG construction methods, based on either crowdsourcing or text mining, are often limited to a small predefined set of relations due to manual cost or restrictions in text corpus. Recent research proposed to use pretrained language models (LMs) as implicit knowledge bases that accept knowledge queries with prompts. Yet, the implicit knowledge lacks many desirable properties of a full-scale symbolic KG, such as easy access, navigation, editing, and quality assurance. In this paper, we propose a new approach of harvesting massive KGs of arbitrary relations from pretrained LMs. With minimal input of a relation definition (a prompt and a few shot of example entity pairs), the approach efficiently searches in the vast entity pair space to extract diverse accurate knowledge of the desired relation. We develop an effective search-and-rescore mechanism for improved efficiency and accuracy. We deploy the approach to harvest KGs of over 400 new relations, from LMs of varying capacities such as RoBERTaNet. Extensive human and automatic evaluations show our approach manages to extract diverse accurate knowledge, including tuples of complex relations (e.g., “A is capable of but not good at B”). The resulting KGs as a symbolic interpretation of the source LMs also reveal new insights into the LMs’ knowledge capacities.more » « less
-
null (Ed.)Neural Machine Translation (NMT) models have been observed to produce poor translations when there are few/no parallel sentences to train the models. In the absence of parallel data, several approaches have turned to the use of images to learn translations. Since images of words, e.g., horse may be unchanged across languages, translations can be identified via images associated with words in different languages that have a high degree of visual similarity. However, translating via images has been shown to improve upon text-only models only marginally. To better understand when images are useful for translation, we study image translatability of words, which we define as the translatability of words via images, by measuring intra- and inter-cluster similarities of image representations of words that are translations of each other. We find that images of words are not always invariant across languages, and that language pairs with shared culture, meaning having either a common language family, ethnicity or religion, have improved image translatability (i.e., have more similar images for similar words) compared to its converse, regardless of their geographic proximity. In addition, in line with previous works that show images help more in translating concrete words, we found that concrete words have improved image translatability compared to abstract ones.more » « less
An official website of the United States government

