skip to main content


Title: DiSCoVeR: a materials discovery screening tool for high performance, unique chemical compositions
We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR) (https://www.github.com/sparks-baird/mat_discover), a Python tool for identifying and assessing high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model. In this work, we create pairwise distance matrices between compounds via Element Mover's Distance (ElMD) and use these to create 2D density-aware embeddings for chemical compositions via Density-preserving Uniform Manifold Approximation and Projection (DensMAP). Because ElMD assigns distances between compounds that are more chemically intuitive than Euclidean-based distances, the compounds can then be clustered into chemically homogeneous clusters via Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN*). In combination with performance predictions via Compositionally-Restricted Attention-Based Network (CrabNet), we introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multi-objective Pareto front plots and assign a weighted score to each composition that encompasses the trade-off between performance and density-based chemical uniqueness. In addition to density-based metrics, we explore an additional uniqueness proxy related to property gradients in DensMAP space. As a validation study, we use DiSCoVeR to screen materials for both performance and uniqueness to extrapolate to new chemical spaces. Top-10 rankings are provided for the compound-wise density and property gradient uniqueness proxies. Top-ranked compounds can be further curated via literature searches, physics-based simulations, and/or experimental synthesis. Finally, we compare DiSCoVeR against the naive baseline of random search for several parameter combinations in an adaptive design scheme. To our knowledge, this is the first time automated screening has been performed with explicit emphasis on discovering high-performing, novel materials.  more » « less
Award ID(s):
1950589 1651668
NSF-PAR ID:
10319410
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Digital Discovery
ISSN:
2635-098X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Accurate theoretical predictions of desired properties of materials play an important role in materials research and development. Machine learning (ML) can accelerate the materials design by building a model from input data. For complex datasets, such as those of crystalline compounds, a vital issue is how to construct low-dimensional representations for input crystal structures with chemical insights. In this work, we introduce an algebraic topology-based method, called atom-specific persistent homology (ASPH), as a unique representation of crystal structures. The ASPH can capture both pairwise and many-body interactions and reveal the topology-property relationship of a group of atoms at various scales. Combined with composition-based attributes, ASPH-based ML model provides a highly accurate prediction of the formation energy calculated by density functional theory (DFT). After training with more than 30,000 different structure types and compositions, our model achieves a mean absolute error of 61 meV/atom in cross-validation, which outperforms previous work such as Voronoi tessellations and Coulomb matrix method using the same ML algorithm and datasets. Our results indicate that the proposed topology-based method provides a powerful computational tool for predicting materials properties compared to previous works. 
    more » « less
  2. Abstract

    Emerging machine-learned models have enabled efficient and accurate prediction of compound formation energy, with the most prevalent models relying on graph structures for representing crystalline materials. Here, we introduce an alternative approach based on sparse voxel images of crystals. By developing a sophisticated network architecture, we showcase the ability to learn the underlying features of structural and chemical arrangements in inorganic compounds from visual image representations, subsequently correlating these features with the compounds’ formation energy. Our model achieves accurate formation energy prediction by utilizing skip connections in a deep convolutional network and incorporating augmentation of rotated crystal samples during training, performing on par with state-of-the-art methods. By adopting visual images as an alternative representation for crystal compounds and harnessing the capabilities of deep convolutional networks, this study extends the frontier of machine learning for accelerated materials discovery and optimization. In a comprehensive evaluation, we analyse the predicted convex hulls for 3115 binary systems and introduce error metrics beyond formation energy error. This evaluation offers valuable insights into the impact of formation energy error on the performance of the predicted convex hulls.

     
    more » « less
  3. Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability. 
    more » « less
  4. Important data mining problems such as nearestneighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability. 
    more » « less
  5. Largely due to superior properties compared to traditional materials, the use of polymer matrix composites (PMC) has been expanding in several industries such as aerospace, transportation, defense, and marine. However, the anisotropy and nonhomogeneity of these structures contribute to the difficulty in evaluating structural integrity; damage sites can occur at multiple locations and length scales and are hard to track over time. This can lead to unpredictable and expensive failure of a safety-critical structure, thus creating a need for non-destructive evaluation (NDE) techniques which can detect and quantify small-scale damage sites and track their progression. Our research group has improved upon classical microwave techniques to address these needs; utilizing a custom device to move a sample within a resonant cavity and create a spatial map of relative permittivity. We capitalize on the inevitable presence of moisture within the polymer network to detect damage. The differing migration inclinations of absorbed water molecules in a pristine versus a damaged composite alters the respective concentrations of the two chemical states of moisture. The greater concentration of free water molecules residing in the damage sites exhibit highly different relative permittivity when compared to the higher ratio of polymer-bound water molecules in the undamaged areas. Currently, the technique has shown the ability to detect impact damage across a range of damage levels and gravimetric moisture contents but is not able to specifically quantify damage extent with regards to impact energy level. The applicability of machine learning (ML) to composite materials is substantial, with uses in areas like manufacturing and design, prediction of structural properties, and damage detection. Using traditional NDE techniques in conjunction with supervised or unsupervised ML has been shown to improve the accuracy, reliability, or efficiency of the existing methods. In this work, we explore the use of a combined unsupervised/supervised ML approach to determine a damage boundary and quantification of single-impact specimens. Dry composite specimens were damaged via drop tower to induce one central impact site of 0, 2, or 3 Joules. After moisture exposure, Entrepreneur Dr, Raleigh, North Carolina 27695, U.S.A. 553 each specimen underwent dielectric mapping, and spatial permittivity maps were created at a variety of gravimetric moisture contents. An unsupervised K-means clustering algorithm was applied to the dielectric data to segment the levels of damage and define a damage boundary. Subsequently, supervised learning was used to quantify damage using features including but not limited to thickness, moisture content, permittivity values of each cluster, and average distance between points in each cluster. A regression model was trained on several samples with impact energy as the predicted variable. Evaluation was then performed based on prediction accuracy for samples in which the impact energies are not known to the model.

     
    more » « less