skip to main content


Title: Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning
Abstract

Scientific literature is one of the most significant resources for sharing knowledge. Researchers turn to scientific literature as a first step in designing an experiment. Given the extensive and growing volume of literature, the common approach of reading and manually extracting knowledge is too time consuming, creating a bottleneck in the research cycle. This challenge spans nearly every scientific domain. For the materials science, experimental data distributed across millions of publications are extremely helpful for predicting materials properties and the design of novel materials. However, only recently researchers have explored computational approaches for knowledge extraction primarily for inorganic materials. This study aims to explore knowledge extraction for organic materials. We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts. We used named‐entity‐recognition (NER) with BiLSTM‐CNN‐CRF deep learning model to automatically extract key knowledge from literature. Early‐phase results show a high potential for automated knowledge extraction. The paper presents our findings and a framework for supervised knowledge extraction that can be adapted to other scientific domains.

 
more » « less
Award ID(s):
1940239
NSF-PAR ID:
10306121
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proceedings of the Association for Information Science and Technology
Volume:
58
Issue:
1
ISSN:
2373-9231
Page Range / eLocation ID:
p. 558-562
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Purpose The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science. Design/methodology/approach The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach. Findings The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies. Originality/value To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area. 
    more » « less
  2. Abstract

    Conceptual models are necessary to synthesize what is known about a topic, identify gaps in knowledge and improve understanding. The process of developing conceptual models that summarize the literature using ad hoc approaches has high potential to be incomplete due to the challenges of tracking information and hypotheses across the literature.

    We present a novel, systematic approach to conceptual model development through qualitative synthesis and graphical analysis of hypotheses already present in the scientific literature. Our approach has five stages: researchers explicitly define the scope of the question, conduct a systematic review, extract hypotheses from prior studies, assemble hypotheses into a single network model and analyse trends in the model through network analysis.

    The resulting network can be analysed to identify shifts in thinking over time, variation in the application of ideas over different axes of investigation (e.g. geography, taxonomy, ecosystem type) and the most important hypotheses based on the network structure. To illustrate the approach, we present examples from a case study that applied the method to synthesize decades of research on the effects of forest fragmentation on birds.

    This approach can be used to synthesize scientific thinking across any field of research, guide future research to fill knowledge gaps efficiently and help researchers systematically build conceptual models representing alternative hypotheses.

     
    more » « less
  3. Introduction: There is an overwhelming amount of journal articles for modern researchers to parse through. For instance, there have already been 168,168 cancer-related papers archived on PubMed this year. In order to keep up with this substantial amount of literature, there are emerging interests in applying artificial intelligence (AI) to facilitate paper reading and drafting of new scientific ideas. Here, we extend the application of the state-of-the-art automatic research assistants to the cancer field. Using training datasets composed of over 5,000 cancer-related journal papers abstracts, we evaluated AI-based background knowledge extraction and abstract writing. The best AI performance is rated to be on par with human writers through a survey to university cancer researchers. This automatic research assistant tool can potentially speed up scientific discovery and production by helping researchers to efficiently read existing papers, create new ideas and write up new discoveries. 
    more » « less
  4. Abstract

    A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could profoundly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over recent years, making it challenging for human researchers to keep track of the progress. Here we use AI techniques to predict the future research directions of AI itself. We introduce a graph-based benchmark based on real-world data—the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 143,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. These results indicate a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.

     
    more » « less
  5. Abstract

    The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

     
    more » « less