skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fine-Tuning BERT Model for Materials Named Entity Recognition
Scientific literature presents a wellspring of cutting-edge knowledge for materials science, including valuable data (e.g., numerical data from experiment results, material properties and structure). These data are critical for accelerating materials discovery by data-driven machine learning (ML) methods. The challenge is, it is impossible for humans to manually extract and retain this knowledge due to the extensive and growing volume of publications.To this end, we explore a fine-tuned BERT model for extracting knowledge. Our preliminary results show that our fine-tuned Bert model reaches an f-score of 85% for the materials named entity recognition task. The paper covers background, related work, methodology including tuning parameters, and our overall performance evaluation. Our discussion offers insights into our results, and points to directions for next steps.  more » « less
Award ID(s):
1940239
PAR ID:
10393959
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
3717 to 3720
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present CyBERT, a domain-specific Bidirectional Encoder Representations from Transformers (BERT) model, fine-tuned with a large corpus of textual cybersecurity data. State-of-the-art natural language models that can process dense, fine-grained textual threat, attack, and vulnerability information can provide numerous benefits to the cybersecurity community. The primary contribution of this paper is providing the security community with an initial fine-tuned BERT model that can perform a variety of cybersecurity-specific downstream tasks with high accuracy and efficient use of resources. We create a cybersecurity corpus from open-source unstructured and semi-unstructured Cyber Threat Intelligence (CTI) data and use it to fine-tune a base BERT model with Masked Language Modeling (MLM) to recognize specialized cybersecurity entities. We evaluate the model using various downstream tasks that can benefit modern Security Operations Centers (SOCs). The finetuned CyBERT model outperforms the base BERT model in the domain-specific MLM evaluation. We also provide use-cases of CyBERT application in cybersecurity based downstream tasks. 
    more » « less
  2. null (Ed.)
    Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the “good” subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns. 
    more » « less
  3. null (Ed.)
    We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention weight and computing the maximum spanning tree---to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets---the syntax-oriented CoLA and the semantics-oriented MNLI---to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn. 
    more » « less
  4. Abstract The prediction of crystal properties plays a crucial role in materials science and applications. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). However, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. In this paper, we develop and make public a benchmark dataset (TextEdge) that contains crystal text descriptions with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of large language models (LLMs) to predict properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based methods by approximately 8% on predicting band gap, 3% on classifying whether the band gap is direct or indirect, and 65% on predicting unit cell volume, and yields comparable performance on predicting formation energy per atom, energy per atom, and energy above hull. LLM-Prop also outperforms the fine-tuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. We further fine-tune the LLM-Prop model directly on CIF files and condensed structure information generated by Robocrystallographer and found that LLM-Prop fine-tuned on text descriptions provides a better performance on average. Our empirical results highlight the importance of having a natural language input to LLMs to accurately predict crystal properties and the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction. 
    more » « less
  5. RNN Tranducer (RNN-T) technology is very popular for building deployable models for end-to-end (E2E) automatic speech recognition (ASR) and spoken language understanding (SLU). Since these are E2E models operating on speech directly, there remains a potential to improve their performance using purely text based models like BERT, which have strong language understanding capabilities. In this paper, we propose a new training criteria for RNN-T based E2E ASR and SLU to transfer BERT’s knowledge into these systems. In the first stage of our proposed mechanism, we improve ASR performance by using a fine-grained, tokenwise knowledge transfer from BERT. In the second stage, we fine-tune the ASR model for SLU such that the above knowledge is explicitly utilized by the RNN-T model for improved performance. Our techniques improve ASR performance on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and on the recently released SLURP dataset on which we achieve a new state-of-the-art performance. For SLU, we show significant improvements on the SLURP slot filling task, outperforming HuBERT-base and reaching a performance close to HuBERTlarge. Compared to large transformer based speech models like HuBERT, our model is significantly more compact and uses only 300 hours of speech pretraining data. 
    more » « less