Abstract In current in situ X-ray diffraction (XRD) techniques, data generation surpasses human analytical capabilities, potentially leading to the loss of insights. Automated techniques require human intervention, and lack the performance and adaptability required for material exploration. Given the critical need for high-throughput automated XRD pattern analysis, we present a generalized deep learning model to classify a diverse set of materials’ crystal systems and space groups. In our approach, we generate training data with a holistic representation of patterns that emerge from varying experimental conditions and crystal properties. We also employ an expedited learning technique to refine our model’s expertise to experimental conditions. In addition, we optimize model architecture to elicit classification based on Bragg’s Law and use evaluation data to interpret our model’s decision-making. We evaluate our models using experimental data, materials unseen in training, and altered cubic crystals, where we observe state-of-the-art performance and even greater advances in space group classification.
more »
« less
Impact of data bias on machine learning for crystal compound synthesizability predictions
Abstract Machine learning models are susceptible to being misled by biases in training data that emphasize incidental correlations over the intended learning task. In this study, we demonstrate the impact of data bias on the performance of a machine learning model designed to predict the likelihood of synthesizability of crystal compounds. The model performs a binary classification on labeled crystal samples. Despite using the same architecture for the machine learning model, we showcase how the model’s learning and prediction behavior differs once trained on distinct data. We use two data sets for illustration: a mixed-source data set that integrates experimental and computational crystal samples and a single-source data set consisting of data exclusively from one computational database. We present simple procedures to detect data bias and to evaluate its effect on the model’s performance and generalization. This study reveals how inconsistent, unbalanced data can propagate bias, undermining real-world applicability even for advanced machine learning techniques.
more »
« less
- PAR ID:
- 10558576
- Publisher / Repository:
- IOP Publishing Ltd
- Date Published:
- Journal Name:
- Machine Learning: Science and Technology
- Volume:
- 5
- Issue:
- 4
- ISSN:
- 2632-2153
- Page Range / eLocation ID:
- 040501
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Thenkabail, Prasad S. (Ed.)Physically based hydrologic models require significant effort and extensive information for development, calibration, and validation. The study explored the use of the random forest regression (RFR), a supervised machine learning (ML) model, as an alternative to the physically based Soil and Water Assessment Tool (SWAT) for predicting streamflow in the Rio Grande Headwaters near Del Norte, a snowmelt-dominated mountainous watershed of the Upper Rio Grande Basin. Remotely sensed data were used for the random forest machine learning analysis (RFML) and RStudio for data processing and synthesizing. The RFML model outperformed the SWAT model in accuracy and demonstrated its capability in predicting streamflow in this region. We implemented a customized approach to the RFR model to assess the model’s performance for three training periods, across 1991–2010, 1996–2010, and 2001–2010; the results indicated that the model’s accuracy improved with longer training periods, implying that the model trained on a more extended period is better able to capture the parameters’ variability and reproduce streamflow data more accurately. The variable importance (i.e., IncNodePurity) measure of the RFML model revealed that the snow depth and the minimum temperature were consistently the top two predictors across all training periods. The paper also evaluated how well the SWAT model performs in reproducing streamflow data of the watershed with a conventional approach. The SWAT model needed more time and data to set up and calibrate, delivering acceptable performance in annual mean streamflow simulation, with satisfactory index of agreement (d), coefficient of determination (R2), and percent bias (PBIAS) values, but monthly simulation warrants further exploration and model adjustments. The study recommends exploring snowmelt runoff hydrologic processes, dust-driven sublimation effects, and more detailed topographic input parameters to update the SWAT snowmelt routine for better monthly flow estimation. The results provide a critical analysis for enhancing streamflow prediction, which is valuable for further research and water resource management, including snowmelt-driven semi-arid regions.more » « less
-
Fairness Artificial Intelligence (AI) aims to identify and mitigate bias throughout the AI development process, spanning data collection, modeling, assessment, and deployment—a critical facet of establishing trustworthy AI systems. Tackling data bias through techniques like reweighting samples proves effective for promoting fairness. This paper undertakes a systematic exploration of reweighting samples for conventional Machine-Learning (ML) models, utilizing five models for binary classification on datasets such as Adult Income and COMPAS, incorporating various protected attributes. In particular, AI Fairness 360 (AIF360) from IBM, a versatile open-source library aimed at identifying and mitigating bias in machine-learning models throughout the entire AI application lifecycle, is employed as the foundation for conducting this systematic exploration. The evaluation of prediction outcomes employs five fairness metrics from AIF360, elucidating the nuanced and model-specific efficacy of reweighting samples in fostering fairness within traditional ML frameworks. Experimental results illustrate that reweighting samples effectively reduces bias in traditional ML methods for classification tasks. For instance, after reweighting samples, the balanced accuracy of Decision Tree (DT) improves to 100%, and its bias, as measured by fairness metrics such as Average Odds Difference (AOD), Equal Opportunity Difference (EOD), and Theil Index (TI), is mitigated to 0. However, reweighting samples does not effectively enhance the fairness performance of K Nearest Neighbor (KNN). This sheds light on the intricate dynamics of bias, underscoring the complexity involved in achieving fairness across different models and scenarios.more » « less
-
Active learning has long been a topic of study in machine learning. However, as increasingly complex and opaque models have become standard practice, the process of active learning, too, has become more opaque. There has been little investigation into interpreting what specific trends and patterns an active learning strategy may be exploring. This work expands on the Local Interpretable Model-agnostic Explanations framework (LIME) to provide explanations for active learning recommendations. We demonstrate how LIME can be used to generate locally faithful explanations for an active learning strategy, and how these explanations can be used to understand how different models and datasets explore a problem space over time. These explanations can also be used to generate batches based on common sources of uncertainty. These regions of common uncertainty can be useful for understanding a model’s current weaknesses. In order to quantify the per-subgroup differences in how an active learning strategy queries spatial regions, we introduce a notion of uncertainty bias (based on disparate impact) to measure the discrepancy in the confidence for a model’s predictions between one subgroup and another. Using the uncertainty bias measure, we show that our query explanations accurately reflect the subgroup focus of the active learning queries, allowing for an interpretable explanation of what is being learned as points with similar sources of uncertainty have their uncertainty bias resolved. We demonstrate that this technique can be applied to track uncertainty bias over user-defined clusters or automatically generated clusters based on the source of uncertainty. We also measure how the choice of initial labeled examples effects groups over time.more » « less
-
PurposeThe purpose of this study is to develop a deep learning framework for additive manufacturing (AM), that can detect different defect types without being trained on specific defect data sets and can be applied for real-time process control. Design/methodology/approachThis study develops an explainable artificial intelligence (AI) framework, a zero-bias deep neural network (DNN) model for real-time defect detection during the AM process. In this method, the last dense layer of the DNN is replaced by two consecutive parts, a regular dense layer denoted (L1) for dimensional reduction, and a similarity matching layer (L2) for equal weight and non-biased cosine similarity matching. Grayscale images of 3D printed samples acquired during printing were used as the input to the zero-bias DNN. FindingsThis study demonstrates that the approach is capable of successfully detecting multiple types of defects such as cracks, stringing and warping with high accuracy without any prior training on defective data sets, with an accuracy of 99.5%. Practical implicationsOnce the model is set up, the computational time for anomaly detection is lower than the speed of image acquisition indicating the potential for real-time process control. It can also be used to minimize manual processing in AI-enabled AM. Originality/valueTo the best of the authors’ knowledge, this is the first study to use zero-bias DNN, an explainable AI approach for defect detection in AM.more » « less
An official website of the United States government

