skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Leveraging language representation for materials exploration and discovery
Abstract Data-driven approaches to materials exploration and discovery are building momentum due to emerging advances in machine learning. However, parsimonious representations of crystals for navigating the vast materials search space remain limited. To address this limitation, we introduce a materials discovery framework that utilizes natural language embeddings from language models as representations of compositional and structural features. The contextual knowledge encoded in these language representations conveys information about material properties and structures, enabling both similarity analysis to recall relevant candidates based on a query material and multi-task learning to share information across related properties. Applying this framework to thermoelectrics, we demonstrate diversified recommendations of prototype crystal structures and identify under-studied material spaces. Validation through first-principles calculations and experiments confirms the potential of the recommended materials as high-performance thermoelectrics. Language-based frameworks offer versatile and adaptable embedding structures for effective materials exploration and discovery, applicable across diverse material systems.  more » « less
Award ID(s):
2118201
PAR ID:
10496435
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
npj Computational Materials
Volume:
10
Issue:
1
ISSN:
2057-3960
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Thermoelectric materials harvest waste heat and convert it into reusable electricity. Thermoelectrics are also widely used in inverse ways such as refrigerators and cooling electronics. However, most popular and known thermoelectric materials to date were proposed and found by intuition, mostly through experiments. Unfortunately, it is extremely time and resource consuming to synthesize and measure the thermoelectric properties through trial-and-error experiments. Here, we develop a convolutional neural network (CNN) classification model that utilizes the fused orbital field matrix and composition descriptors to screen a large pool of materials to discover new thermoelectric candidates with power factor higher than 10 μW/cm K2. The model used our own data generated by high-throughput density functional theory calculations coupled with ab initio scattering and transport package to obtain electronic transport properties without assuming constant relaxation time of electrons, which ensures more reliable electronic transport properties calculations than previous studies. The classification model was also compared to some traditional machine learning algorithms such as gradient boosting and random forest. We deployed the classification model on 3465 cubic dynamically stable structures with non-zero bandgap screened from Open Quantum Materials Database. We identified many high-performance thermoelectric materials with ZT > 1 or close to 1 across a wide temperature range from 300 to 700 K and for both n- and p-type doping with different doping concentrations. Moreover, our feature importance and maximal information coefficient analysis demonstrates two previously unreported material descriptors, namely, mean melting temperature and low average deviation of electronegativity, that are strongly correlated with power factor and thus provide a new route for quickly screening potential thermoelectrics with high success rate. Our deep CNN model with fused orbital field matrix and composition descriptors is very promising for screening high power factor thermoelectrics from large-scale hypothetical structures. 
    more » « less
  2. Abstract Despite the machine learning (ML) methods have been largely used recently, the predicted materials properties usually cannot exceed the range of original training data. We deployed a boundless objective-free exploration approach to combine traditional ML and density functional theory (DFT) in searching extreme material properties. This combination not only improves the efficiency for screening large-scale materials with minimal DFT inquiry, but also yields properties beyond original training range. We use Stein novelty to recommend outliers and then verify using DFT. Validated data are then added into the training dataset for next round iteration. We test the loop of training-recommendation-validation in mechanical property space. By screening 85,707 crystal structures, we identify 21 ultrahigh hardness structures and 11 negative Poisson’s ratio structures. The algorithm is very promising for future materials discovery that can push materials properties to the limit with minimal DFT calculations on only ~1% of the structures in the screening pool. 
    more » « less
  3. Abstract Modern data mining methods have demonstrated effectiveness in comprehending and predicting materials properties. An essential component in the process of materials discovery is to know which material(s) will possess desirable properties. For many materials properties, performing experiments and density functional theory computations are costly and time-consuming. Hence, it is challenging to build accurate predictive models for such properties using conventional data mining methods due to the small amount of available data. Here we present a framework for materials property prediction tasks using structure information that leverages graph neural network-based architecture along with deep-transfer-learning techniques to drastically improve the model’s predictive ability on diverse materials (3D/2D, inorganic/organic, computational/experimental) data. We evaluated the proposed framework in cross-property and cross-materials class scenarios using 115 datasets to find that transfer learning models outperform the models trained from scratch in 104 cases, i.e., ≈90%, with additional benefits in performance for extrapolation problems. We believe the proposed framework can be widely useful in accelerating materials discovery in materials science. 
    more » « less
  4. The development of next-generation energy storage systems relies on discovering new materials that support multivalent-ion transport. Transition metal oxides (TMOs) are promising due to their structural versatility, high ionic conductivity, and ability to accommodate multiple charge carriers. However, their vast compositional and structural diversity makes traditional exploration inefficient. This work presents a generative AI framework combining a crystal diffusion variational autoencoder (CDVAE) and a fine-tuned large language model (LLM) to discover porous oxide materials. Thousands of candidate structures are generated and screened for structural validity, thermodynamic stability, and electronic properties using a graph-based machine learning model and density functional theory (DFT) calculations. CDVAE identifies a broader variety of structures, including five novel TMO-based candidates, while LLM excels in generating highly stable structures near equilibrium. This approach demonstrates the power of generative AI in accelerating the discovery of advanced battery materials for multivalent-ion storage. 
    more » « less
  5. Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules’ properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-state geometric structures and biomedical texts, resulting in a PubChem3D dataset. Based on this dataset, we propose the GeomCLIP framework to enhance geometric pretraining and understanding by biomedical texts. During pre-training, we design two types of tasks, i.e., multimodal representation alignment and unimodal denoising pretraining, to align the 3D geometric encoder with textual information and, at the same time, preserve its original representation power. Experimental results show the effectiveness of GeomCLIP in various tasks such as molecule property prediction, zero-shot text-molecule retrieval, and 3D molecule captioning. Our code and collected dataset are available at https://github.com/xiaocui3737/GeomCLIP. 
    more » « less