Impact of data bias on machine learning for crystal compound synthesizability predictions

Davariashtiyani, Ali; Wang, Busheng; Hajinazar, Samad (ORCID:0000000272555932); Zurek, Eva (ORCID:000000030738867X); Kadkhodaei, Sara (ORCID:0000000342633191)

doi:10.1088/2632-2153/ad9378

Citation Details

Impact of data bias on machine learning for crystal compound synthesizability predictions

Abstract Machine learning models are susceptible to being misled by biases in training data that emphasize incidental correlations over the intended learning task. In this study, we demonstrate the impact of data bias on the performance of a machine learning model designed to predict the likelihood of synthesizability of crystal compounds. The model performs a binary classification on labeled crystal samples. Despite using the same architecture for the machine learning model, we showcase how the model’s learning and prediction behavior differs once trained on distinct data. We use two data sets for illustration: a mixed-source data set that integrates experimental and computational crystal samples and a single-source data set consisting of data exclusively from one computational database. We present simple procedures to detect data bias and to evaluate its effect on the model’s performance and generalization. This study reveals how inconsistent, unbalanced data can propagate bias, undermining real-world applicability even for advanced machine learning techniques. more »

Award ID(s):: 2119308 2119065

PAR ID:: 10556921

Author(s) / Creator(s):: Davariashtiyani, Ali; Wang, Busheng; Hajinazar, Samad; Zurek, Eva; Kadkhodaei, Sara

Publisher / Repository:: IOP Publishing

Date Published:: 2024-11-26

Journal Name:: Machine Learning: Science and Technology

Volume:: 5

Issue:: 4

ISSN:: 2632-2153

Format(s):: Medium: X Size: Article No. 040501

Size(s):: Article No. 040501

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1088/2632-2153/ad9378

More Like this