skip to main content


This content will become publicly available on May 28, 2025

Title: Discovering melting temperature prediction models of inorganic solids by combining supervised and unsupervised learning

The melting temperature is important for materials design because of its relationship with thermal stability, synthesis, and processing conditions. Current empirical and computational melting point estimation techniques are limited in scope, computational feasibility, or interpretability. We report the development of a machine learning methodology for predicting melting temperatures of binary ionic solid materials. We evaluated different machine-learning models trained on a dataset of the melting points of 476 non-metallic crystalline binary compounds using materials embeddings constructed from elemental properties and density-functional theory calculations as model inputs. A direct supervised-learning approach yields a mean absolute error of around 180 K but suffers from low interpretability. We find that the fidelity of predictions can further be improved by introducing an additional unsupervised-learning step that first classifies the materials before the melting-point regression. Not only does this two-step model exhibit improved accuracy, but the approach also provides a level of interpretability with insights into feature importance and different types of melting that depend on the specific atomic bonding inside a material. Motivated by this finding, we used a symbolic learning approach to find interpretable physical models for the melting temperature, which recovered the best-performing features from both prior models and provided additional interpretability.

 
more » « less
Award ID(s):
1940303
PAR ID:
10532097
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
AIP Publishing
Date Published:
Journal Name:
The Journal of Chemical Physics
Volume:
160
Issue:
20
ISSN:
0021-9606
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Machine learning methods have helped to advance wide range of scientific and technological field in recent years, including computational chemistry. As the chemical systems could become complex with high dimension, feature selection could be critical but challenging to develop reliable machine learning based prediction models, especially for proteins as bio‐macromolecules. In this study, we applied sparse group lasso (SGL) method as a general feature selection method to develop classification model for an allosteric protein in different functional states. This results into a much improved model with comparable accuracy (Acc) and only 28 selected features comparing to 289 selected features from a previous study. The Acc achieves 91.50% with 1936 selected feature, which is far higher than that of baseline methods. In addition, grouping protein amino acids into secondary structures provides additional interpretability of the selected features. The selected features are verified as associated with key allosteric residues through comparison with both experimental and computational works about the model protein, and demonstrate the effectiveness and necessity of applying rigorous feature selection and evaluation methods on complex chemical systems.

     
    more » « less
  2. Abstract Background

    Advanced machine learning models have received wide attention in assisting medical decision making due to the greater accuracy they can achieve. However, their limited interpretability imposes barriers for practitioners to adopt them. Recent advancements in interpretable machine learning tools allow us to look inside the black box of advanced prediction methods to extract interpretable models while maintaining similar prediction accuracy, but few studies have investigated the specific hospital readmission prediction problem with this spirit.

    Methods

    Our goal is to develop a machine-learning (ML) algorithm that can predict 30- and 90- day hospital readmissions as accurately as black box algorithms while providing medically interpretable insights into readmission risk factors. Leveraging a state-of-art interpretable ML model, we use a two-step Extracted Regression Tree approach to achieve this goal. In the first step, we train a black box prediction algorithm. In the second step, we extract a regression tree from the output of the black box algorithm that allows direct interpretation of medically relevant risk factors. We use data from a large teaching hospital in Asia to learn the ML model and verify our two-step approach.

    Results

    The two-step method can obtain similar prediction performance as the best black box model, such as Neural Networks, measured by three metrics: accuracy, the Area Under the Curve (AUC) and the Area Under the Precision-Recall Curve (AUPRC), while maintaining interpretability. Further, to examine whether the prediction results match the known medical insights (i.e., the model is truly interpretable and produces reasonable results), we show that key readmission risk factors extracted by the two-step approach are consistent with those found in the medical literature.

    Conclusions

    The proposed two-step approach yields meaningful prediction results that are both accurate and interpretable. This study suggests a viable means to improve the trust of machine learning based models in clinical practice for predicting readmissions through the two-step approach.

     
    more » « less
  3. Abstract

    The budding field of materials informatics has coincided with a shift towards artificial intelligence to discover new solid-state compounds. The steady expansion of repositories for crystallographic and computational data has set the stage for developing data-driven models capable of predicting a bevy of physical properties. Machine learning methods, in particular, have already shown the ability to identify materials with near ideal properties for energy-related applications by screening crystal structure databases. However, examples of the data-guided discovery of entirely new, never-before-reported compounds remain limited. The critical step for determining if an unknown compound is synthetically accessible is obtaining the formation energy and constructing the associated convex hull. Fortunately, this information has become widely available through density functional theory (DFT) data repositories to the point that they can be used to develop machine learning models. In this Review, we discuss the specific design choices for developing a machine learning model capable of predicting formation energy, including the thermodynamic quantities governing material stability. We investigate several models presented in the literature that cover various possible architectures and feature sets and find that they have succeeded in uncovering new DFT-stable compounds and directing materials synthesis. To expand access to machine learning models for synthetic solid-state chemists, we additionally presentMatLearn. This web-based application is intended to guide the exploration of a composition diagram towards regions likely to contain thermodynamically accessible inorganic compounds. Finally, we discuss the future of machine-learned formation energy and highlight the opportunities for improved predictive power toward the synthetic realization of new energy-related materials.

     
    more » « less
  4. null (Ed.)
    The application of machine learning models and algorithms towards describing atomic interactions has been a major area of interest in materials simulations in recent years, as machine learning interatomic potentials (MLIPs) are seen as being more flexible and accurate than their classical potential counterparts. This increase in accuracy of MLIPs over classical potentials has come at the cost of significantly increased complexity, leading to higher computational costs and lower physical interpretability and spurring research into improving the speeds and interpretability of MLIPs. As an alternative, in this work we leverage “machine learning” fitting databases and advanced optimization algorithms to fit a class of spline-based classical potentials, showing that they can be systematically improved in order to achieve accuracies comparable to those of low-complexity MLIPs. These results demonstrate that high model complexities may not be strictly necessary in order to achieve near-DFT accuracy in interatomic potentials and suggest an alternative route towards sampling the high accuracy, low complexity region of model space by starting with forms that promote simpler and more interpretable inter- atomic potentials. 
    more » « less
  5. null (Ed.)
    Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately. 
    more » « less