skip to main content


Title: Challenges and Benchmark Datasets for Machine Learning in the Atmospheric Sciences: Definition, Status, and Outlook
Abstract Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspects of this paper will also hold for other aspects of the Earth system sciences or are at least transferable. Significance Statement Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided.  more » « less
Award ID(s):
2019758
NSF-PAR ID:
10422671
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Artificial Intelligence for the Earth Systems
Volume:
1
Issue:
3
ISSN:
2769-7525
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

     
    more » « less
  2. Our ability to project changes to the climate via anthropogenic forcing has steadily increased over the last five decades. Yet, biologists still lack accurate projections about climate change impacts. Despite recent advances, biologists still often rely on correlative approaches to make projections, ignore important mechanisms, develop models with limited coordination, and lack much of the data to inform projections and test them. In contrast, atmospheric scientists have incorporated mechanistic data, established a global network of weather stations, and apply multi‐model inference by comparing divergent model projections. I address the following questions: How have the two fields developed through time? To what degree does biological projection differ from climate projection? What is needed to make similar progress in biological projection? Although the challenges in biodiversity projections are great, I highlight how biology can make substantial progress in the coming years. Most obstacles are surmountable and relate to history, lag times, scientific culture, international organization, and finances. Just as climate change projections have improved, biological modeling can improve in accuracy by incorporating mechanistic understanding, employing multi‐model ensemble approaches, coordinating efforts worldwide, and validating projections against records from a well‐designed network of biotic stations. Now that climate scientists can make better projections of climate change, biologists need to project and prevent its impacts on biodiversity.

    This article is categorized under:

    Climate, Ecology, and Conservation > Modeling Species and Community Interactions

     
    more » « less
  3. null (Ed.)
    Abstract Machine learning and artificial intelligence (ML/AI) methods have been used successfully in recent years to solve problems in many areas, including image recognition, unsupervised and supervised classification, game-playing, system identification and prediction, and autonomous vehicle control. Data-driven machine learning methods have also been applied to fusion energy research for over 2 decades, including significant advances in the areas of disruption prediction, surrogate model generation, and experimental planning. The advent of powerful and dedicated computers specialized for large-scale parallel computation, as well as advances in statistical inference algorithms, have greatly enhanced the capabilities of these computational approaches to extract scientific knowledge and bridge gaps between theoretical models and practical implementations. Large-scale commercial success of various ML/AI applications in recent years, including robotics, industrial processes, online image recognition, financial system prediction, and autonomous vehicles, have further demonstrated the potential for data-driven methods to produce dramatic transformations in many fields. These advances, along with the urgency of need to bridge key gaps in knowledge for design and operation of reactors such as ITER, have driven planned expansion of efforts in ML/AI within the US government and around the world. The Department of Energy (DOE) Office of Science programs in Fusion Energy Sciences (FES) and Advanced Scientific Computing Research (ASCR) have organized several activities to identify best strategies and approaches for applying ML/AI methods to fusion energy research. This paper describes the results of a joint FES/ASCR DOE-sponsored Research Needs Workshop on Advancing Fusion with Machine Learning, held April 30–May 2, 2019, in Gaithersburg, MD (full report available at https://science.osti.gov/-/media/fes/pdf/workshop-reports/FES_ASCR_Machine_Learning_Report.pdf ). The workshop drew on broad representation from both FES and ASCR scientific communities, and identified seven Priority Research Opportunities (PRO’s) with high potential for advancing fusion energy. In addition to the PRO topics themselves, the workshop identified research guidelines to maximize the effectiveness of ML/AI methods in fusion energy science, which include focusing on uncertainty quantification, methods for quantifying regions of validity of models and algorithms, and applying highly integrated teams of ML/AI mathematicians, computer scientists, and fusion energy scientists with domain expertise in the relevant areas. 
    more » « less
  4. Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a S elf-supervised Tra nsformer for T ime- S eries (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS . 
    more » « less
  5. Abstract

    Within the broad field of plant sciences, what are the most pressing challenges and opportunities to advance? Answers to this question usually include food and nutritional security, climate change mitigation, adaptation of plants to changing climates, preservation of biodiversity and ecosystem services, production of plant‐based proteins and products, and growth of the bioeconomy. Genes and the processes their products carry out create differences in how plants grow, develop, and behave, and thus, the key solutions to these challenges lie squarely in the space where plant genomics and physiology intersect. Advancements in genomics, phenomics, and analysis tools have generated massive datasets, but these data are complex and have not always generated scientific insights at the anticipated pace. Further, new tools may need to be created or adapted, and field‐relevant applications tested, to advance scientific discovery derived from such datasets. Meaningful, relevant conclusions and connections from genomics and plant physiological and biochemical data require both subject matter expertise and the collaborative skills needed to work together outside of specific disciplines. Bringing the best expertise to bear on complex problems in plant sciences requires enhanced, inclusive, and sustained collaboration across disciplines. However, despite significant efforts to enable and sustain collaborative research, a variety of challenges persist. Here, we present the outcomes and conclusions of two workshops convened to address the need for collaboration between scientists engaged in plant physiology, genetics, and genomics and to discuss the approaches that will create the necessary environments to support successful collaboration. We conclude with approaches to share and reward collaboration and the need to train inclusive scientists that will have the skills to thrive in interdisciplinary contexts.

     
    more » « less