skip to main content

Title: FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy

A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data is transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practice of science and engineering, we introduce a set of practical, concise, and measurable FAIR principles for AI models. We showcase how to create and share FAIR data and AI models within a unified computational framework combining the following elements: the Advanced Photon Source at Argonne National Laboratory, the Materials Data Facility, the Data and Learning Hub for Science, and funcX, and the Argonne Leadership Computing Facility (ALCF), in particular the ThetaGPU supercomputer and the SambaNova DataScale®system at the ALCF AI Testbed. We describe how this domain-agnostic computational framework may be harnessed to enable autonomous AI-driven discovery.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Data
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We introduce an end-to-end computational framework that allows for hyperparameter optimization using theDeepHyperlibrary, accelerated model training, and interpretable AI inference. The framework is based on state-of-the-art AI models includingCGCNN,PhysNet,SchNet,MPNN,MPNN-transformer, andTorchMD-NET. We employ these AI models along with the benchmarkQM9,hMOF, andMD17datasets to showcase how the models can predict user-specified material properties within modern computing environments. We demonstrate transferable applications in the modeling of small molecules, inorganic crystals and nanoporous metal organic frameworks with a unified, standalone framework. We have deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and in the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership-class computing environments. We release these digital assets as open source scientific software in GitLab, and ready-to-use Jupyter notebooks in Google Colab.

    more » « less
  2. Abstract

    The findable, accessible, interoperable, and reusable (FAIR) data principles provide a framework for examining, evaluating, and improving how data is shared to facilitate scientific discovery. Generalizing these principles to research software and other digital products is an active area of research. Machine learning (ML) models---algorithms that have been trained on data without being explicitly programmed---and more generally, artificial intelligence (AI) models, are an important target for this because of the ever-increasing pace with which AI is transforming scientific domains, such as experimental high energy physics (HEP). In this paper, we propose a practical definition of FAIR principles for AI models in HEP and describe a template for the application of these principles. We demonstrate the template's use with an example AI model applied to HEP, in which a graph neural network is used to identify Higgs bosons decaying to two bottom quarks. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.

    more » « less
  3. Abstract

    Introduced in 2016, the FAIR Guiding Principles endeavour to significantly improve the process of today's data‐driven research. The Principles present a concise set of fundamental concepts that can facilitate the findability, accessibility, interoperability and reuse (FAIR) of digital research objects by both machines and human beings. The emergence of FAIR has initiated a flurry of activity within the broader data publication community, yet the principles are still not fully understood by many community stakeholders. This has led to challenges such as misinterpretation and co‐opted use, along with persistent gaps in current data publication culture, practices and infrastructure that need to be addressed to achieve a FAIR data end‐state. This paper presents an overview of the practices and perspectives related to the FAIR Principles within the Geosciences and offers discussion on the value of the principles in the larger context of what they are trying to achieve. The authors of this article recommend using the principles as a tool to bring awareness to the types of actions that can improve the practice of data publication to meet the needs of all data consumers. FAIR Guiding Principles should be interpreted as an aspirational guide to focus behaviours that lead towards a more FAIR data environment. The intentional discussions and incremental changes that bring us closer to these aspirations provide the best value to our community as we build the capacity that will support and facilitate new discovery of earth systems.

    more » « less
  4. Abstract

    For over three decades, the materials tetrahedron has captured the essence of materials science and engineering with its interdependent elements of processing, structure, properties, and performance. As modern computational and statistical techniques usher in a new paradigm of data-intensive scientific research and discovery, the rate at which the field of materials science and engineering capitalizes on these advances hinges on collaboration between numerous stakeholders. Here, we provide a contemporary extension to the classic materials tetrahedron with a dual framework—adapted from the concept of a “digital twin”—which offers a nexus joining materials science and information science. We believe this high-level framework, the materials–information twin tetrahedra (MITT), will provide stakeholders with a platform to contextualize, translate, and direct efforts in the pursuit of propelling materials science and technology forward.

    Impact statement

    This article provides a contemporary reimagination of the classic materials tetrahedron by augmenting it with parallel notions from information science. Since the materials tetrahedron (processing, structure, properties, performance) made its first debut, advances in computational and informational tools have transformed the landscape and outlook of materials research and development. Drawing inspiration from the notion of a digital twin, the materials–information twin tetrahedra (MITT) framework captures a holistic perspective of materials science and engineering in the presence of modern digital tools and infrastructures. This high-level framework incorporates sustainability and FAIR data principles (Findable, Accessible, Interoperable, Reusable)—factors that recognize how systems impact and interact with other systems—in addition to the data and information flows that play a pivotal role in knowledge generation. The goal of the MITT framework is to give stakeholders from academia, industry, and government a communication tool for focusing efforts around the design, development, and deployment of materials in the years ahead.

    Graphic abstract 
    more » « less
  5. Abstract

    Large‐scale, reproducible manufacturing of therapeutic cells with consistently high quality is vital for translation to clinically effective and widely accessible cell therapies. However, the biological and logistical complexity of manufacturing a living product, including challenges associated with their inherent variability and uncertainties of process parameters, currently make it difficult to achieve predictable cell‐product quality. Using a degradable microscaffold‐based T‐cell process, we developed an artificial intelligence (AI)‐driven experimental‐computational platform to identify a set of critical process parameters and critical quality attributes from heterogeneous, high‐dimensional, time‐dependent multiomics data, measurable during early stages of manufacturing and predictive of end‐of‐manufacturing product quality. Sequential, design‐of‐experiment‐based studies, coupled with an agnostic machine‐learning framework, were used to extract feature combinations from early in‐culture media assessment that were highly predictive of the end‐product CD4/CD8 ratio and total live CD4+and CD8+naïve and central memory T cells (CD63L+CCR7+). Our results demonstrate a broadly applicable platform tool to predict end‐product quality and composition from early time point in‐process measurements during therapeutic cell manufacturing.

    more » « less