skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Identifying the Central Figure of a Scientific Paper
Publishers are increasingly using graphical abstracts to facilitate scientific search, especially across disciplinary boundaries. They are presented on various media, easily shared and information rich. However, very small amount of scientific publications are equipped with graphical abstracts. What can we do with the vast majority of papers with no selected graphical abstract? In this paper, we first hypothesize that scientific papers actually include a "central figure" that serve as a graphical abstract. These figures convey the key results and provide a visual identity for the paper. Using survey data collected from 6,263 authors regarding 8,353 papers over 15 years, we find that over 87% of papers are considered to contain a central figure, and that these central figures are primarily used to summarize important results, explain the key methods, or provide additional discussion. We then train a model to automatically recognize the central figure, achieving top-3 accuracy of 78% and exact match accuracy of 34%. We find that the primary boost in accuracy comes from figure captions that resemble the abstract. We make all our data and results publicly available at https://github.com/viziometrics/centraul_figure. Our goal is to automate central figure identification to improve search engine performance and to help scientists connect ideas across the literature.  more » « less
Award ID(s):
1740996 1915774
PAR ID:
10188257
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2019 International Conference on Document Analysis and Recognition (ICDAR)
Volume:
September. 2019
Page Range / eLocation ID:
1063 to 1070
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat. 
    more » « less
  2. Westenberg, Dave J (Ed.)
    ABSTRACT Integrating primary scientific literature into Science, Technology, Engineering, and Mathematics (STEM) curricula enhances critical thinking, scientific literacy, and communication skills but presents challenges due to complex terminology and data interpretation barriers. To address these challenges, a scaffolded journal club approach was implemented in a Cancer Biology course. The course utilized Hypothes.is web-based annotations, methods presentations, figure annotations, and structured discussions to promote active engagement with the literature. Additionally, integrated science communication assignments—including written, graphical, and video abstracts—provided diverse opportunities for students to develop scientific literacy. This structured approach is designed to facilitate comprehension, encourage proactive learning, and foster confidence in engaging with primary scientific literature. Student feedback highlighted improved ability to dissect research articles, enhanced presentation skills, and increased enjoyment of scientific reading. The journal club model and science communication assignments offer a replicable framework for enhancing primary scientific literature engagement across various STEM disciplines and educational levels. 
    more » « less
  3. Introduction: There is an overwhelming amount of journal articles for modern researchers to parse through. For instance, there have already been 168,168 cancer-related papers archived on PubMed this year. In order to keep up with this substantial amount of literature, there are emerging interests in applying artificial intelligence (AI) to facilitate paper reading and drafting of new scientific ideas. Here, we extend the application of the state-of-the-art automatic research assistants to the cancer field. Using training datasets composed of over 5,000 cancer-related journal papers abstracts, we evaluated AI-based background knowledge extraction and abstract writing. The best AI performance is rated to be on par with human writers through a survey to university cancer researchers. This automatic research assistant tool can potentially speed up scientific discovery and production by helping researchers to efficiently read existing papers, create new ideas and write up new discoveries. 
    more » « less
  4. null (Ed.)
    Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers. 
    more » « less
  5. Abstract Emerging studies underscore the promising capabilities of large language model-based chatbots in conducting basic bioinformatics data analyses. The recent feature of accepting image inputs by ChatGPT, also known as GPT-4V(ision), motivated us to explore its efficacy in deciphering bioinformatics scientific figures. Our evaluation with examples in cancer research, including sequencing data analysis, multimodal network-based drug repositioning, and tumor clonal evolution, revealed that ChatGPT can proficiently explain different plot types and apply biological knowledge to enrich interpretations. However, it struggled to provide accurate interpretations when color perception and quantitative analysis of visual elements were involved. Furthermore, while the chatbot can draft figure legends and summarize findings from the figures, stringent proofreading is imperative to ensure the accuracy and reliability of the content. 
    more » « less