skip to main content

Title: Applications and Techniques for Fast Machine Learning in Science
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Award ID(s):
2132700 2003098 1934757 1934700 1836650
Publication Date:
Journal Name:
Frontiers in Big Data
Sponsoring Org:
National Science Foundation
More Like this
  1. The rapid development and application of machine learning (ML) techniques in materials science have led to new tools for machine-enabled and autonomous/high-throughput materials design and discovery. Alongside, efforts to extract data from traditional experiments in the published literature with natural language processing (NLP) algorithms provide opportunities to develop tremendous data troves for these in silico design and discovery endeavors. While NLP is used in all aspects of society, its application in materials science is still in the very early stages. This perspective provides a case study on the application of NLP to extract information related to the preparation of organic materials. We present the case study at a basic level with the aim to discuss these technologies and processes with researchers from diverse scientific backgrounds. We also discuss the challenges faced in the case study and provide an assessment to improve the accuracy of NLP techniques for materials science with the aid of community contributions.
  2. In 2020, the White House released the “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on the graph mining and transformer models. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover ongoing research findings such as the relationship between COVID-19 and oxytocin hormone.All code, details, andmore »pre-trained models are available at« less
  3. Abstract

    Plants, and the biological systems around them, are key to the future health of the planet and its inhabitants. The Plant Science Decadal Vision 2020–2030 frames our ability to perform vital and far‐reaching research in plant systems sciences, essential to how we value participants and apply emerging technologies. We outline a comprehensive vision for addressing some of our most pressing global problems through discovery, practical applications, and education. The Decadal Vision was developed by the participants at the Plant Summit 2019, a community event organized by the Plant Science Research Network. The Decadal Vision describes a holistic vision for the next decade of plant science that blends recommendations for research, people, and technology. Going beyond discoveries and applications, we, the plant science community, must implement bold, innovative changes to research cultures and training paradigms in this era of automation, virtualization, and the looming shadow of climate change. Our vision and hopes for the next decade are encapsulated in the phrase reimagining the potential of plants for a healthy and sustainable future. The Decadal Vision recognizes the vital intersection of human and scientific elements and demands an integrated implementation of strategies for research (Goals 1–4), people (Goals 5 and 6),more »and technology (Goals 7 and 8). This report is intended to help inspire and guide the research community, scientific societies, federal funding agencies, private philanthropies, corporations, educators, entrepreneurs, and early career researchers over the next 10 years. The research encompass experimental and computational approaches to understanding and predicting ecosystem behavior; novel production systems for food, feed, and fiber with greater crop diversity, efficiency, productivity, and resilience that improve ecosystem health; approaches to realize the potential for advances in nutrition, discovery and engineering of plant‐based medicines, and green infrastructure. Launching the Transparent Plant will use experimental and computational approaches to break down the phytobiome into a parts store that supports tinkering and supports query, prediction, and rapid‐response problem solving. Equity, diversity, and inclusion are indispensable cornerstones of realizing our vision. We make recommendations around funding and systems that support customized professional development. Plant systems are frequently taken for granted therefore we make recommendations to improve plant awareness and community science programs to increase understanding of scientific research. We prioritize emerging technologies, focusing on non‐invasive imaging, sensors, and plug‐and‐play portable lab technologies, coupled with enabling computational advances. Plant systems science will benefit from data management and future advances in automation, machine learning, natural language processing, and artificial intelligence‐assisted data integration, pattern identification, and decision making. Implementation of this vision will transform plant systems science and ripple outwards through society and across the globe. Beyond deepening our biological understanding, we envision entirely new applications. We further anticipate a wave of diversification of plant systems practitioners while stimulating community engagement, underpinning increasing entrepreneurship. This surge of engagement and knowledge will help satisfy and stoke people's natural curiosity about the future, and their desire to prepare for it, as they seek fuller information about food, health, climate and ecological systems.

    « less
  4. Tarolli, P. ; Mudd, S. (Ed.)
    High-resolution topography (HRT) is a powerful observational tool for studying the Earth's surface, vegetation, and urban landscapes, with broad scientific, engineering, and education-based applications. Submeter resolution imaging is possible when collected with laser and photogrammetric techniques using the ground, air, and space-based platforms. Open access to these data and a cyberinfrastructure platform that enables users to discover, manage, share, and process then increases the impact of investments in data collection and catalyzes scientific discovery. Furthermore, open and online access to data enables broad interdisciplinary use of HRT across academia and in communities such as education, public agencies, and the commercial sector. OpenTopography, supported by the US National Science Foundation, aims to democratize access to Earth science-oriented, HRT data and processing tools. We utilize cyberinfrastructure, including large-scale data management, high-performance computing, and service-oriented architectures to provide efficient web-based visualization and access to large, HRT datasets. OT colocates data with processing tools to enable users to quickly access custom data and derived products for their application, with the ultimate goal of making these powerful data easier to use. OT's rapidly growing data holdings currently include 283 lidar and photogrammetric, point cloud datasets (>1.2 trillion points) covering 236,364km2. As a testament to OT'smore »success, more than 86,000 users have processed over 5 trillion lidar points. This use has resulted in more than 290 peer-reviewed publications across numerous academic domains including Earth science, geography, computer science, and ecology.« less
  5. Abstract Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspectsmore »of this paper will also hold for other aspects of the Earth system sciences or are at least transferable. Significance Statement Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided.« less