skip to main content


Title: Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning
Abstract

Characterization of material structure with X-ray or neutron scattering using e.g. Pair Distribution Function (PDF) analysis most often rely on refining a structure model against an experimental dataset. However, identifying a suitable model is often a bottleneck. Recently, automated approaches have made it possible to test thousands of models for each dataset, but these methods are computationally expensive and analysing the output, i.e. extracting structural information from the resulting fits in a meaningful way, is challenging. OurMachineLearning basedMotifExtractor (ML-MotEx) trains an ML algorithm on thousands of fits, and uses SHAP (SHapley Additive exPlanation) values to identify which model features are important for the fit quality. We use the method for 4 different chemical systems, including disordered nanomaterials and clusters. ML-MotEx opens for a type of modelling where each feature in a model is assigned an importance value for the fit quality based on explainable ML.

 
more » « less
Award ID(s):
1922234
NSF-PAR ID:
10474231
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
nature
Date Published:
Journal Name:
npj Computational Materials
Volume:
8
Issue:
1
ISSN:
2057-3960
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Many scientific problems focus on observed patterns of change or on how to design a system to achieve particular dynamics. Those problems often require fitting differential equation models to target trajectories. Fitting such models can be difficult because each evaluation of the fit must calculate the distance between the model and target patterns at numerous points along a trajectory. The gradient of the fit with respect to the model parameters can be challenging to compute. Recent technical advances in automatic differentiation through numerical differential equation solvers potentially change the fitting process into a relatively easy problem, opening up new possibilities to study dynamics. However, application of the new tools to real data may fail to achieve a good fit. This article illustrates how to overcome a variety of common challenges, using the classic ecological data for oscillations in hare and lynx populations. Models include simple ordinary differential equations (ODEs) and neural ordinary differential equations (NODEs), which use artificial neural networks to estimate the derivatives of differential equation systems. Comparing the fits obtained with ODEs versus NODEs, representing small and large parameter spaces, and changing the number of variable dimensions provide insight into the geometry of the observed and model trajectories. To analyze the quality of the models for predicting future observations, a Bayesian‐inspired preconditioned stochastic gradient Langevin dynamics (pSGLD) calculation of the posterior distribution of predicted model trajectories clarifies the tendency for various models to underfit or overfit the data. Coupling fitted differential equation systems with pSGLD sampling provides a powerful way to study the properties of optimization surfaces, raising an analogy with mutation‐selection dynamics on fitness landscapes.

     
    more » « less
  2. SUMMARY

    Seismic tomography is a cornerstone of geophysics and has led to a number of important discoveries about the interior of the Earth. However, seismic tomography remains plagued by the large number of unknown parameters in most tomographic applications. This leads to the inverse problem being underdetermined and requiring significant non-geologically motivated smoothing in order to achieve unique answers. Although this solution is acceptable when using tomography as an explorative tool in discovery mode, it presents a significant problem to use of tomography in distinguishing between acceptable geological models or in estimating geologically relevant parameters since typically none of the geological models considered are fit by the tomographic results, even when uncertainties are accounted for. To address this challenge, when seismic tomography is to be used for geological model selection or parameter estimation purposes, we advocate that the tomography can be explicitly parametrized in terms of the geological models being tested instead of using more mathematically convenient formulations like voxels, splines or spherical harmonics. Our proposition has a number of technical difficulties associated with it, with some of the most important ones being the move from a linear to a non-linear inverse problem, the need to choose a geological parametrization that fits each specific problem and is commensurate with the expected data quality and structure, and the need to use a supporting framework to identify which model is preferred by the tomographic data. In this contribution, we introduce geological parametrization of tomography with a few simple synthetic examples applied to imaging sedimentary basins and subduction zones, and one real-world example of inferring basin and crustal properties across the continental United States. We explain the challenges in moving towards more realistic examples, and discuss the main technical difficulties and how they may be overcome. Although it may take a number of years for the scientific program suggested here to reach maturity, it is necessary to take steps in this direction if seismic tomography is to develop from a tool for discovering plausible structures to one in which distinct scientific inferences can be made regarding the presence or absence of structures and their physical characteristics.

     
    more » « less
  3. Abstract

    Remanufacturing sites often receive products with different brands, models, conditions, and quality levels. Proper sorting and classification of the waste stream is a primary step in efficiently recovering and handling used products. The correct classification is particularly crucial in future electronic waste (e-waste) management sites equipped with Artificial Intelligence (AI) and robotic technologies. Robots should be enabled with proper algorithms to recognize and classify products with different features and prepare them for assembly and disassembly tasks. In this study, two categories of Machine Learning (ML) and Deep Learning (DL) techniques are used to classify consumer electronics. ML models include Naïve Bayes with Bernoulli, Gaussian, Multinomial distributions, and Support Vector Machine (SVM) algorithms with four kernels of Linear, Radial Basis Function (RBF), Polynomial, and Sigmoid. While DL models include VGG-16, GoogLeNet, Inception-v3, Inception-v4, and ResNet-50. The above-mentioned models are used to classify three laptop brands, including Apple, HP, and ThinkPad. First the Edge Histogram Descriptor (EHD) and Scale Invariant Feature Transform (SIFT) are used to extract features as inputs to ML models for classification. DL models use laptop images without pre-processing on feature extraction. The trained models are slightly overfitting due to the limited dataset and complexity of model parameters. Despite slight overfitting, the models can identify each brand. The findings prove that DL models outperform them of ML. Among DL models, GoogLeNet has the highest performance in identifying the laptop brands.

     
    more » « less
  4. Purpose

    Diffusion weighted MRI imaging (DWI) is often subject to low signal‐to‐noise ratios (SNRs) and artifacts. Recent work has produced software tools that can correct individual problems, but these tools have not been combined with each other and with quality assurance (QA). A single integrated pipeline is proposed to perform DWI preprocessing with a spectrum of tools and produce an intuitive QA document.

    Methods

    The proposed pipeline, built around the FSL, MRTrix3, and ANTs software packages, performs DWI denoising; inter‐scan intensity normalization; susceptibility‐, eddy current‐, and motion‐induced artifact correction; and slice‐wise signal drop‐out imputation. To perform QA on the raw and preprocessed data and each preprocessing operation, the pipeline documents qualitative visualizations, quantitative plots, gradient verifications, and tensor goodness‐of‐fit and fractional anisotropy analyses.

    Results

    Raw DWI data were preprocessed and quality checked with the proposed pipeline and demonstrated improved SNRs; physiologic intensity ratios; corrected susceptibility‐, eddy current‐, and motion‐induced artifacts; imputed signal‐lost slices; and improved tensor fits. The pipeline identified incorrect gradient configurations and file‐type conversion errors and was shown to be effective on externally available datasets.

    Conclusions

    The proposed pipeline is a single integrated pipeline that combines established diffusion preprocessing tools from major MRI‐focused software packages with intuitive QA.

     
    more » « less
  5. Abstract

    Abstract.—Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.

     
    more » « less