skip to main content

This content will become publicly available on February 7, 2023

Title: DiSCoVeR: a materials discovery screening tool for high performance, unique chemical compositions
We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR) (https://www.github.com/sparks-baird/mat_discover), a Python tool for identifying and assessing high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model. In this work, we create pairwise distance matrices between compounds via Element Mover's Distance (ElMD) and use these to create 2D density-aware embeddings for chemical compositions via Density-preserving Uniform Manifold Approximation and Projection (DensMAP). Because ElMD assigns distances between compounds that are more chemically intuitive than Euclidean-based distances, the compounds can then be clustered into chemically homogeneous clusters via Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN*). In combination with performance predictions via Compositionally-Restricted Attention-Based Network (CrabNet), we introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multi-objective Pareto front plots and assign a weighted score to each composition that encompasses the trade-off between performance and density-based chemical uniqueness. In addition to density-based metrics, we explore an additional uniqueness proxy related to property gradients in DensMAP space. As a validation study, we use DiSCoVeR to screen materials for both performance and uniqueness more » to extrapolate to new chemical spaces. Top-10 rankings are provided for the compound-wise density and property gradient uniqueness proxies. Top-ranked compounds can be further curated via literature searches, physics-based simulations, and/or experimental synthesis. Finally, we compare DiSCoVeR against the naive baseline of random search for several parameter combinations in an adaptive design scheme. To our knowledge, this is the first time automated screening has been performed with explicit emphasis on discovering high-performing, novel materials. « less
Authors:
; ;
Award ID(s):
1950589
Publication Date:
NSF-PAR ID:
10319410
Journal Name:
Digital Discovery
ISSN:
2635-098X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Accurate theoretical predictions of desired properties of materials play an important role in materials research and development. Machine learning (ML) can accelerate the materials design by building a model from input data. For complex datasets, such as those of crystalline compounds, a vital issue is how to construct low-dimensional representations for input crystal structures with chemical insights. In this work, we introduce an algebraic topology-based method, called atom-specific persistent homology (ASPH), as a unique representation of crystal structures. The ASPH can capture both pairwise and many-body interactions and reveal the topology-property relationship of a group of atoms at variousmore »scales. Combined with composition-based attributes, ASPH-based ML model provides a highly accurate prediction of the formation energy calculated by density functional theory (DFT). After training with more than 30,000 different structure types and compositions, our model achieves a mean absolute error of 61 meV/atom in cross-validation, which outperforms previous work such as Voronoi tessellations and Coulomb matrix method using the same ML algorithm and datasets. Our results indicate that the proposed topology-based method provides a powerful computational tool for predicting materials properties compared to previous works.« less
  2. Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs.more »We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability.« less
  3. Important data mining problems such as nearestneighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs.more »We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability.« less
  4. Martelli, Pier Luigi (Ed.)
    Abstract Motivation Accurate prediction of residue-residue distances is important for protein structure prediction. We developed several protein distance predictors based on a deep learning distance prediction method and blindly tested them in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The prediction method uses deep residual neural networks with the channel-wise attention mechanism to classify the distance between every two residues into multiple distance intervals. The input features for the deep learning method include co-evolutionary features as well as other sequence-based features derived from multiple sequence alignments (MSAs). Three alignment methods are used with multiple protein sequence/profile databases tomore »generate MSAs for input feature generation. Based on different configurations and training strategies of the deep learning method, five MULTICOM distance predictors were created to participate in the CASP14 experiment. Results Benchmarked on 37 hard CASP14 domains, the best performing MULTICOM predictor is ranked 5th out of 30 automated CASP14 distance prediction servers in terms of precision of top L/5 long-range contact predictions (i.e. classifying distances between two residues into two categories: in contact (< 8 Angstrom) and not in contact otherwise) and performs better than the best CASP13 distance prediction method. The best performing MULTICOM predictor is also ranked 6th among automated server predictors in classifying inter-residue distances into 10 distance intervals defined by CASP14 according to the precision of distance classification. The results show that the quality and depth of MSAs depend on alignment methods and sequence databases and have a significant impact on the accuracy of distance prediction. Using larger training datasets and multiple complementary features improves prediction accuracy. However, the number of effective sequences in MSAs is only a weak indicator of the quality of MSAs and the accuracy of predicted distance maps. In contrast, there is a strong correlation between the accuracy of contact/distance predictions and the average probability of the predicted contacts, which can therefore be more effectively used to estimate the confidence of distance predictions and select predicted distance maps. Availability The software package, source code, and data of DeepDist2 are freely available at https://github.com/multicom-toolbox/deepdist and https://zenodo.org/record/4712084#.YIIM13VKhQM.« less
  5. Abstract

    We build random forests models to predict elastic properties and mechanical hardness of a compound, using only its chemical formula as input. The model training uses over 10,000 target compounds and 60 features based on stoichiometric attributes, elemental properties, orbital occupations, and ionic bonding levels. Using the models, we construct triangular graphs for B-C-N compounds to map out their bulk and shear moduli, as well as hardness values. The graphs indicate that a 1:1 B-N ratio can lead to various superhard compositions. We also validate the machine learning results by evolutionary structure prediction and density functional theory. Our studymore »shows that BC10N, B4C5N3, and B2C3N exhibit dynamically stable phases with hardness values >40 GPa, which are superhard materials that potentially could be synthesized by low-temperature plasma methods.

    « less