Abstract Machine learning models are susceptible to being misled by biases in training data that emphasize incidental correlations over the intended learning task. In this study, we demonstrate the impact of data bias on the performance of a machine learning model designed to predict the likelihood of synthesizability of crystal compounds. The model performs a binary classification on labeled crystal samples. Despite using the same architecture for the machine learning model, we showcase how the model’s learning and prediction behavior differs once trained on distinct data. We use two data sets for illustration: a mixed-source data set that integrates experimental and computational crystal samples and a single-source data set consisting of data exclusively from one computational database. We present simple procedures to detect data bias and to evaluate its effect on the model’s performance and generalization. This study reveals how inconsistent, unbalanced data can propagate bias, undermining real-world applicability even for advanced machine learning techniques.
more »
« less
Automated classification of big X-ray diffraction data using deep learning models
Abstract In current in situ X-ray diffraction (XRD) techniques, data generation surpasses human analytical capabilities, potentially leading to the loss of insights. Automated techniques require human intervention, and lack the performance and adaptability required for material exploration. Given the critical need for high-throughput automated XRD pattern analysis, we present a generalized deep learning model to classify a diverse set of materials’ crystal systems and space groups. In our approach, we generate training data with a holistic representation of patterns that emerge from varying experimental conditions and crystal properties. We also employ an expedited learning technique to refine our model’s expertise to experimental conditions. In addition, we optimize model architecture to elicit classification based on Bragg’s Law and use evaluation data to interpret our model’s decision-making. We evaluate our models using experimental data, materials unseen in training, and altered cubic crystals, where we observe state-of-the-art performance and even greater advances in space group classification.
more »
« less
- Award ID(s):
- 2202124
- PAR ID:
- 10477803
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- npj Computational Materials
- Volume:
- 9
- Issue:
- 1
- ISSN:
- 2057-3960
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modern machine learning algorithms typically require large amounts of labeled training data to fit a reliable model. To minimize the cost of data collection, researchers often employ techniques such as crowdsourcing and web scraping. However, web data and human annotations are known to exhibit high margins of error, resulting in sizable amounts of incorrect labels. Poorly labeled training data can cause models to overfit to the noise distribution, crippling performance in real-world applications. In this work, we investigate the viability of using data augmentation in conjunction with semi-supervised learning to improve the label noise robustness of image classification models. We conduct several experiments using noisy variants of the CIFAR-10 image classification dataset to benchmark our method against existing algorithms. Experimental results show that our augmentative SSL approach improves upon the state-of-the-art.more » « less
-
A bottleneck in high-throughput nanomaterials discovery is the pace at which new materials can be structurally characterized. Although current machine learning (ML) methods show promise for the automated processing of electron diffraction patterns (DPs), they fail in high-throughput experiments where DPs are collected from crystals with random orientations. Inspired by the human decision-making process, a framework for automated crystal system classification from DPs with arbitrary orientations was developed. A convolutional neural network was trained using evidential deep learning, and the predictive uncertainties were quantified and leveraged to fuse multiview predictions. Using vector map representations of DPs, the framework achieves a testing accuracy of 0.94 in the examples considered, is robust to noise, and retains remarkable accuracy using experimental data. This work highlights the ability of ML to be used to accelerate experimental high-throughput materials data analytics.more » « less
-
Material characterization techniques are widely used to characterize the physical and chemical properties of materials at the nanoscale and, thus, play central roles in material scientific discoveries. However, the large and complex datasets generated by these techniques often require significant human effort to interpret and extract meaningful physicochemical insights. Artificial intelligence (AI) techniques such as machine learning (ML) have the potential to improve the efficiency and accuracy of surface analysis by automating data analysis and interpretation. In this perspective paper, we review the current role of AI in surface analysis and discuss its future potential to accelerate discoveries in surface science, materials science, and interface science. We highlight several applications where AI has already been used to analyze surface analysis data, including the identification of crystal structures from XRD data, analysis of XPS spectra for surface composition, and the interpretation of TEM and SEM images for particle morphology and size. We also discuss the challenges and opportunities associated with the integration of AI into surface analysis workflows. These include the need for large and diverse datasets for training ML models, the importance of feature selection and representation, and the potential for ML to enable new insights and discoveries by identifying patterns and relationships in complex datasets. Most importantly, AI analyzed data must not just find the best mathematical description of the data, but it must find the most physical and chemically meaningful results. In addition, the need for reproducibility in scientific research has become increasingly important in recent years. The advancement of AI, including both conventional and the increasing popular deep learning, is showing promise in addressing those challenges by enabling the execution and verification of scientific progress. By training models on large experimental datasets and providing automated analysis and data interpretation, AI can help to ensure that scientific results are reproducible and reliable. Although integration of knowledge and AI models must be considered for the transparency and interpretability of models, the incorporation of AI into the data collection and processing workflow will significantly enhance the efficiency and accuracy of various surface analysis techniques and deepen our understanding at an accelerated pace.more » « less
-
Modern machine learning algorithms typically require large amounts of labeled training data to fit a reliable model. To minimize the cost of data collection, researchers often employ techniques such as crowdsourcing and web scraping. However, web data and human annotations are known to exhibit high margins of error, resulting in sizable amounts of incorrect labels. Poorly labeled training data can cause models to overfit to the noise distribution, crippling performance in real-world applications. In this work, we investigate the viability of using data augmentation in conjunction with semi-supervised learning to improve the label noise robustness of image classification models. We conduct several experiments using noisy variants of the CIFAR-10 image classification dataset to benchmark our method against existing algorithms. Experimental results show that our augmentative SSL approach improves upon the state-of-the-art.more » « less
An official website of the United States government
