skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A data ecosystem to support machine learning in materials science
Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.  more » « less
Award ID(s):
1636950
PAR ID:
10134745
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
MRS Communications
Volume:
9
Issue:
4
ISSN:
2159-6859
Page Range / eLocation ID:
1125 to 1133
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract ChemMLis an open machine learning (ML) and informatics program suite that is designed to support and advance the data‐driven research paradigm that is currently emerging in the chemical and materials domain.ChemMLallows its users to perform various data science tasks and execute ML workflows that are adapted specifically for the chemical and materials context. Key features are automation, general‐purpose utility, versatility, and user‐friendliness in order to make the application of modern data science a viable and widely accessible proposition in the broader chemistry and materials community.ChemMLis also designed to facilitate methodological innovation, and it is one of the cornerstones of the software ecosystem for data‐driven in silico research. This article is categorized under:Software > Simulation MethodsComputer and Information Science > ChemoinformaticsStructure and Mechanism > Computational Materials ScienceSoftware > Molecular Modeling 
    more » « less
  2. The predictive capabilities of computational materials science today derive from overlapping advances in simulation tools, modeling techniques, and best practices. We outline this ecosystem of molecular simulations by explaining how important contributions in each of these areas have fed into each other. The combined output of these tools, techniques, and practices is the ability for researchers to advance understanding by efficiently combining simple models with powerful software. As specific examples, we show how the prediction of organic photovoltaic morphologies have improved by orders of magnitude over the last decade, and how the processing of reacting epoxy thermosets can now be investigated with million-particle models. We discuss these two materials systems and the training of materials simulators through the lens of cognitive load theory. For students, the broad view of ecosystem components should facilitate understanding how the key parts relate to each other first, followed by targeted exploration. In this way, the paper is organized in loose analogy to a coarse-grained model: The main components provide basic framing and accelerated sampling from which deeper research is better contextualized. For mentors, this paper is organized to provide a snapshot in time of the current simulation ecosystem and an on-ramp for simulation experts into the literature on pedagogical practice. 
    more » « less
  3. Rathje, E.; Montoya, B.; Wayne, M. (Ed.)
    The rise of data capture and storage capabilities have led to greater data granularity and sharing of data sets in geotechnical earthquake engineering. This broader shift to big data requires ways to process and extract value from it and is aided by the progress in methodologies from the computer science domain and advancements in computer hardware capabilities. General machine learning (ML) models typically receive a set of input parameters and run them through an algorithm to gain outputs with no constraints on the parameters or algorithm process. Three topic areas of ML applications in geotechnical earthquake engineering are reviewed and summarized in this paper: seismic response, liquefaction triggering analysis, and performance-based assessments (lateral displacements and settlement analysis). The current progress of ML is summarized, while the challenges and potential in adopting such approaches are addressed. 
    more » « less
  4. Data leakage remains a pervasive issue in machine learning (ML), especially when applied to science, leading to overly optimistic performance estimates and irreproducible findings. Despite its prevalence, data leakage receives limited attention in ML education, in part due to the lack of accessible, hands-on teaching resources. To address this gap, we developed interactive learning modules in which students reproduce examples from academic publications that are affected by data leakage, then repeat the evaluation without the data leakage error to see how the finding is affected. These modules were deployed by the authors in two introductory machine learning courses, enabling students to explore common forms of leakage and their impact on model reliability. Following their engagement with these materials, student feedback highlighted increased awareness of subtle pitfalls that can compromise machine learning workflows. 
    more » « less
  5. Dynamical systems that evolve continuously over time are ubiquitous throughout science and engineering. Machine learning (ML) provides data-driven approaches to model and predict the dynamics of such systems. A core issue with this approach is that ML models are typically trained on discrete data, using ML methodologies that are not aware of underlying continuity properties. This results in models that often do not capture any underlying continuous dynamics—either of the system of interest, or indeed of any related system. To address this challenge, we develop a convergence test based on numerical analysis theory. Our test verifies whether a model has learned a function that accurately approximates an underlying continuous dynamics. Models that fail this test fail to capture relevant dynamics, rendering them of limited utility for many scientific prediction tasks; while models that pass this test enable both better interpolation and better extrapolation in multiple ways. Our results illustrate how principled numerical analysis methods can be coupled with existing ML training/testing methodologies to validate models for science and engineering applications. 
    more » « less