Data Efficiency of Classification Strategies for Chemical and Materials Design

Gallagher, Quinn; Webb, Michael

doi:10.26434/chemrxiv-2024-1sspf-v2

Active learning and design-build-test-learn strategies are increasingly employed to accelerate materials discovery and characterization. Many data-driven materials design campaigns target solutions within constrained domains such as synthesizability, stability, solubility, recyclability, and toxicity. Lack of knowledge about these constraints can hinder design efficiency by producing samples that fail to meet required thresholds. Acquiring this knowledge during the design campaign is inefficient, and effective classification of common materials constraints transcends specific design objectives. However, there is no consensus on the most data-efficient algorithm for classifying whether a material satisfies a constraint. To address this gap, we comprehensively compare the performance of 100 strategies designed to classify chemical and materials behavior. Performance is assessed across 31 classification tasks sourced from the literature in chemical and materials science. From these results, we recommend best practices for building data-efficient classifiers, showing the neural network- and random forest-based active learning algorithms are most efficient across tasks. We also show that classification task complexity can be quantified based on task metafeatures, most notably the noise-to-signal ratio. Overall, this work provides a comprehensive survey of data-efficient classification strategies, identifies attributes of top-performing strategies, and suggests avenues for further study.

More Like this