skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Named Entity Recognition Models for Machine Learning Interatomic Potentials: A User‐Centric Approach to Knowledge Extraction from Scientific Literature
Machine learning interatomic potential (MLIP) is an emerging technique that has helped achieve molecular dynamics simulations with unprecedented balance between efficiency and accuracy. Recently, the body of MLIP literature has been growing rapidly, which propels the need to automatically process relevant information for researchers to understand and utilize. Named entity recognition (NER), a natural language processing technique that identifies and categorizes information from texts, may help summarize key approaches and findings of relevant papers. In this work, we develop an NER model for MLIP literature by fine‐tuning a pre‐trained language model. To streamline text annotation, we build a user‐friendly web application for annotation and proofreading, which is seamlessly integrated into the training procedure. Our model can identify technical entities with an F1 score of 0.8 for new MLIP paper abstracts using only 60 training paper abstracts and up to 0.75 for scientific texts on different topics. Notably, some “errors” in predictions are actually reasonable decisions, showcasing the model's ability beyond what the performance metrics indicate. This work demonstrates the linguistic capabilities of the NER approach in processing textual information of a specific scientific domain and has the potential to accelerate materials research using language models and contribute to a user‐centric workflow.  more » « less
Award ID(s):
2119276
PAR ID:
10599195
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Advanced Intelligent Discovery
ISSN:
2943-9981
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement). 
    more » « less
  2. This work introduces TrialSieve, a novel framework for biomedical information extraction that enhances clinical meta-analysis and drug repurposing. By extending traditional PICO (Patient, Intervention, Comparison, Outcome) methodologies, TrialSieve incorporates hierarchical, treatment group-based graphs, enabling more comprehensive and quantitative comparisons of clinical outcomes. TrialSieve was used to annotate 1609 PubMed abstracts, 170,557 annotations, and 52,638 final spans, incorporating 20 unique annotation categories that capture a diverse range of biomedical entities relevant to systematic reviews and meta-analyses. The performance (accuracy, precision, recall, F1-score) of four natural-language processing (NLP) models (BioLinkBERT, BioBERT, KRISSBERT, PubMedBERT) and the large language model (LLM), GPT-4o, was evaluated using the human-annotated TrialSieve dataset. BioLinkBERT had the best accuracy (0.875) and recall (0.679) for biomedical entity labeling, whereas PubMedBERT had the best precision (0.614) and F1-score (0.639). Error analysis showed that NLP models trained on noisy, human-annotated data can match or, in most cases, surpass human performance. This finding highlights the feasibility of fully automating biomedical information extraction, even when relying on imperfectly annotated datasets. An annotator user study (n = 39) revealed significant (p < 0.05) gains in efficiency and human annotation accuracy with the unique TrialSieve tree-based annotation approach. In summary, TrialSieve provides a foundation to improve automated biomedical information extraction for frontend clinical research. 
    more » « less
  3. Abstract Scientific literature is one of the most significant resources for sharing knowledge. Researchers turn to scientific literature as a first step in designing an experiment. Given the extensive and growing volume of literature, the common approach of reading and manually extracting knowledge is too time consuming, creating a bottleneck in the research cycle. This challenge spans nearly every scientific domain. For the materials science, experimental data distributed across millions of publications are extremely helpful for predicting materials properties and the design of novel materials. However, only recently researchers have explored computational approaches for knowledge extraction primarily for inorganic materials. This study aims to explore knowledge extraction for organic materials. We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts. We used named‐entity‐recognition (NER) with BiLSTM‐CNN‐CRF deep learning model to automatically extract key knowledge from literature. Early‐phase results show a high potential for automated knowledge extraction. The paper presents our findings and a framework for supervised knowledge extraction that can be adapted to other scientific domains. 
    more » « less
  4. We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language–Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through quantitative evaluation as well as tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries). and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specifc models for interacting with astronomical data by leveraging text as an interface. 
    more » « less
  5. null (Ed.)
    The number of published manufacturing science digital articles available from scientifc journals and the broader web have exponentially increased every year since the 1990s. To assimilate all of this knowledge by a novice engineer or an experienced researcher, requires signifcant synthesis of the existing knowledge space contained within published material, to fnd answers to basic and complex queries. Algorithmic approaches through machine learning and specifcally Natural Language Processing (NLP) on a domain specifc area such as manufacturing, is lacking. One of the signifcant challenges to analyzing manufacturing vocabulary is the lack of a named entity recognition model that enables algorithms to classify the manufacturing corpus of words under various manufacturing semantic categories. This work presents a supervised machine learning approach to categorize unstructured text from 500K+manufacturing science related scientifc abstracts and labelling them under various manufacturing topic categories. A neural network model using a bidirectional long-short term memory, plus a conditional random feld (BiLSTM+CRF) is trained to extract information from manufacturing science abstracts. Our classifer achieves an overall accuracy (f1-score) of 88%, which is quite near to the state-of-the-art performance. Two use case examples are presented that demonstrate the value of the developed NER model as a Technical Language Processing (TLP) workfow on manufacturing science documents. The long term goal is to extract valuable knowledge regarding the connections and relationships between key manufacturing concepts/entities available within millions of manufacturing documents into a structured labeled-property graph data structure that allow for programmatic query and retrieval. 
    more » « less