Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.
more »
« less
A picture is worth a thousand words: applying natural language processing tools for creating a quantum materials database map
This paper demonstrates the application of Natural Language Processing (NLP) tools to explore large libraries of documents and to correlate heuristic associations between text descriptions in figure captions with interpretations of images and figures. The use of visualization tools based on NLP methods permits one to quickly assess the extent of the research described in the literature related to a specific topic. The authors demonstrate how the use of NLP methods on only the figure captions without having to navigate the entire text of a document can provide an accelerated assessment of the literature in a given domain.
more »
« less
- Award ID(s):
- 1640867
- PAR ID:
- 10188862
- Date Published:
- Journal Name:
- MRS Communications
- Volume:
- 9
- Issue:
- 4
- ISSN:
- 2159-6859
- Page Range / eLocation ID:
- 1134 to 1141
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Publishers are increasingly using graphical abstracts to facilitate scientific search, especially across disciplinary boundaries. They are presented on various media, easily shared and information rich. However, very small amount of scientific publications are equipped with graphical abstracts. What can we do with the vast majority of papers with no selected graphical abstract? In this paper, we first hypothesize that scientific papers actually include a "central figure" that serve as a graphical abstract. These figures convey the key results and provide a visual identity for the paper. Using survey data collected from 6,263 authors regarding 8,353 papers over 15 years, we find that over 87% of papers are considered to contain a central figure, and that these central figures are primarily used to summarize important results, explain the key methods, or provide additional discussion. We then train a model to automatically recognize the central figure, achieving top-3 accuracy of 78% and exact match accuracy of 34%. We find that the primary boost in accuracy comes from figure captions that resemble the abstract. We make all our data and results publicly available at https://github.com/viziometrics/centraul_figure. Our goal is to automate central figure identification to improve search engine performance and to help scientists connect ideas across the literature.more » « less
-
Mitkov, Ruslan; Angelova, Galia (Ed.)This study investigates the use of Natural Language Processing (NLP) methods to analyze politics, conflicts and violence in the Middle East using domain-specific pre-trained language models. We introduce Arabic text and present ConfliBERT-Arabic, a pre-trained language models that can efficiently analyze political, conflict and violence-related texts. Our technique hones a pre-trained model using a corpus of Arabic texts about regional politics and conflicts. Performance of our models is compared to baseline BERT models. Our findings show that the performance of NLP models for Middle Eastern politics and conflict analysis are enhanced by the use of domain-specific pre-trained local language models. This study offers political and conflict analysts, including policymakers, scholars, and practitioners new approaches and tools for deciphering the intricate dynamics of local politics and conflicts directly in Arabic.more » « less
-
As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora. While the study of bias in artificial intelligence is not new, methods to mitigate gender bias in NLP are relatively nascent. In this paper, we review contemporary studies on recognizing and mitigating gender bias in NLP. We discuss gender bias based on four forms of representation bias and analyze methods recognizing gender bias. Furthermore, we discuss the advantages and drawbacks of existing gender debiasing methods. Finally, we discuss future studies for recognizing and mitigating gender bias in NLP.more » « less
-
It is widely recognized that the ability to exploit Natural Language Processing (NLP) text mining strategies has the potential to increase productivity and innovation in the sciences by orders of magnitude, by enabling scientists to pull information from research articles in scientific disciplines such as genomics and biomedicine. The Language Applications (LAPPS) Grid is an infrastructure for rapid development of natural language processing applications (NLP) that provides an ideal platform to support mining scientific literature. Its Galaxy interface and the interoperability among tools together provide an intuitive and easy-to-use platform, and users can experiment with and exploit NLP tools and resources without the need to determine which are suited to a particular task, and without the need for significant computer expertise. The LAPPS Grid has collaborated with the developers of PubAnnotation to integrate the services and resources provided by each in order to greatly enhance the user’s ability to annotate scientific publications and share the results. This poster/demo shows how the LAPPS Grid can facilitate mining scientific publications, including identification and extraction of relevant entities, relations, and events; iterative manual correction and evaluation of automatically-produced annotations, and customization of supporting resources to accommodate specific domains.more » « less
An official website of the United States government

