skip to main content


Title: Identifying the Central Figure of a Scientific Paper
Publishers are increasingly using graphical abstracts to facilitate scientific search, especially across disciplinary boundaries. They are presented on various media, easily shared and information rich. However, very small amount of scientific publications are equipped with graphical abstracts. What can we do with the vast majority of papers with no selected graphical abstract? In this paper, we first hypothesize that scientific papers actually include a "central figure" that serve as a graphical abstract. These figures convey the key results and provide a visual identity for the paper. Using survey data collected from 6,263 authors regarding 8,353 papers over 15 years, we find that over 87% of papers are considered to contain a central figure, and that these central figures are primarily used to summarize important results, explain the key methods, or provide additional discussion. We then train a model to automatically recognize the central figure, achieving top-3 accuracy of 78% and exact match accuracy of 34%. We find that the primary boost in accuracy comes from figure captions that resemble the abstract. We make all our data and results publicly available at https://github.com/viziometrics/centraul_figure. Our goal is to automate central figure identification to improve search engine performance and to help scientists connect ideas across the literature.  more » « less
Award ID(s):
1740996 1915774
NSF-PAR ID:
10188257
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2019 International Conference on Document Analysis and Recognition (ICDAR)
Volume:
September. 2019
Page Range / eLocation ID:
1063 to 1070
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat. 
    more » « less
  2. null (Ed.)
    Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers. 
    more » « less
  3. Background

    We performed a systematic review that identified at least 9,000 scientific papers on PubMed that include immunofluorescent images of cells from the central nervous system (CNS). These CNS papers contain tens of thousands of immunofluorescent neural images supporting the findings of over 50,000 associated researchers. While many existing reviews discuss different aspects of immunofluorescent microscopy, such as image acquisition and staining protocols, few papers discuss immunofluorescent imaging from an image-processing perspective. We analyzed the literature to determine the image processing methods that were commonly published alongside the associated CNS cell, microscopy technique, and animal model, and highlight gaps in image processing documentation and reporting in the CNS research field.

    Methods

    We completed a comprehensive search of PubMed publications using Medical Subject Headings (MeSH) terms and other general search terms for CNS cells and common fluorescent microscopy techniques. Publications were found on PubMed using a combination of column description terms and row description terms. We manually tagged the comma-separated values file (CSV) metadata of each publication with the following categories: animal or cell model, quantified features, threshold techniques, segmentation techniques, and image processing software.

    Results

    Of the almost 9,000 immunofluorescent imaging papers identified in our search, only 856 explicitly include image processing information. Moreover, hundreds of the 856 papers are missing thresholding, segmentation, and morphological feature details necessary for explainable, unbiased, and reproducible results. In our assessment of the literature, we visualized current image processing practices, compiled the image processing options from the top twelve software programs, and designed a road map to enhance image processing. We determined that thresholding and segmentation methods were often left out of publications and underreported or underutilized for quantifying CNS cell research.

    Discussion

    Less than 10% of papers with immunofluorescent images include image processing in their methods. A few authors are implementing advanced methods in image analysis to quantify over 40 different CNS cell features, which can provide quantitative insights in CNS cell features that will advance CNS research. However, our review puts forward that image analysis methods will remain limited in rigor and reproducibility without more rigorous and detailed reporting of image processing methods.

    Conclusion

    Image processing is a critical part of CNS research that must be improved to increase scientific insight, explainability, reproducibility, and rigor.

     
    more » « less
  4. Abstract Background

    Aminoglycosides are potent bactericidal antibiotics naturally produced by soil microorganisms and are commonly used in agriculture. Exposure to these antibiotics has the potential to cause shifts in the microorganisms that impact plant health. The systematic review described in this protocol will compile and synthesize literature on soil and plant root-associated microbiota, with special attention to aminoglycoside exposure. The systematic review should provide insight into how the soil and plant microbiota are impacted by aminoglycoside exposure with specific attention to the changes in the overall species richness and diversity (microbial composition), changes of the resistome (i.e. the changes in the quantification of resistance genes), and maintenance of plant health through suppression of pathogenic bacteria. Moreover, the proposed contribution will provide comprehensive information about data available to guide future primary research studies. This systematic review protocol is based on the question, “What is the impact of aminoglycoside exposure on the soil and plant root-associated microbiota?”.

    Methods

    A boolean search of academic databases and specific websites will be used to identify research articles, conference presentations and grey literature meeting the search criteria. All search results will be compiled and duplicates removed before title and abstract screening. Two reviewers will screen all the included titles and abstracts using a set of predefined inclusion criteria. Full-texts of all titles and abstracts meeting the eligibility criteria will be screened independently by two reviewers. Inclusion criteria will describe the eligible soil and plant root-associated microbiome populations of interest and eligible aminoglycosides constituting our exposure. Study validity will be evaluated using the CEE Critical Appraisal Tool Version 0.2 (Prototype) to evaluate the risk of bias in publications. Data from studies with a low risk of bias will be extracted and compiled into a narrative synthesis and summarized into tables and figures. If sufficient evidence is available, findings will be used to perform a meta-analysis.

     
    more » « less
  5. Abstract Background

    Climate change presents an imminent threat to almost all biological systems across the globe. In recent years there have been a series of studies showing how changes in climate can impact infectious disease transmission. Many of these publications focus on simulations based on in silico data, shadowing empirical research based on field and laboratory data. A synthesis work of empirical climate change and infectious disease research is still lacking.

    Methods

    We conducted a systemic review of research from 2015 to 2020 period on climate change and infectious diseases to identify major trends and current gaps of research. Literature was sourced from Web of Science and PubMed literary repositories using a key word search, and was reviewed using a delineated inclusion criteria by a team of reviewers.

    Results

    Our review revealed that both taxonomic and geographic biases are present in climate and infectious disease research, specifically with regard to types of disease transmission and localities studied. Empirical investigations on vector-borne diseases associated with mosquitoes comprised the majority of research on the climate change and infectious disease literature. Furthermore, demographic trends in the institutions and individuals published revealed research bias towards research conducted across temperate, high-income countries. We also identified key trends in funding sources for most resent literature and a discrepancy in the gender identities of publishing authors which may reflect current systemic inequities in the scientific field.

    Conclusions

    Future research lines on climate change and infectious diseases should considered diseases of direct transmission (non-vector-borne) and more research effort in the tropics. Inclusion of local research in low- and middle-income countries was generally neglected. Research on climate change and infectious disease has failed to be socially inclusive, geographically balanced, and broad in terms of the disease systems studied, limiting our capacities to better understand the actual effects of climate change on health.

    Graphical abstract 
    more » « less