skip to main content

Title: Automated Integration of Continental-scale Observations in Near-Real Time for Simulation and Analysis of Biosphere–Atmosphere Interaction
The National Ecological Observatory Network (NEON) is a continental-scale observatory with sites across the US collecting standardized ecological observations that will operate for multiple decades. To maximize the utility of NEON data, we envision edge computing systems that gather, calibrate, aggregate, and ingest measurements in an integrated fashion. Edge systems will employ machine learning methods to cross-calibrate, gap-fill and provision data in near-real time to the NEON Data Portal and to High Performance Computing (HPC) systems, running ensembles of Earth system models (ESMs) that assimilate the data. For the first time gridded EC data products and response functions promise to offset pervasive observational biases through evaluating, benchmarking, optimizing parameters, and training new ma- chine learning parameterizations within ESMs all at the same model-grid scale. Leveraging open-source software for EC data analysis, we are al- ready building software infrastructure for integration of near-real time data streams into the International Land Model Benchmarking (ILAMB) package for use by the wider research community. We will present a perspective on the design and integration of end-to-end infrastructure for data acquisition, edge computing, HPC simulation, analysis, and validation, where Artificial Intelligence (AI) approaches are used throughout the distributed workflow to improve accuracy and computational performance.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The National Ecological Observatory Network (NEON) is a multidecadal and continental-scale observatory with sites across the United States. Having entered its operational phase in 2018, NEON data products, software, and services become available to facilitate research on the impacts of climate change, land-use change, and invasive species. An essential component of NEON are its 47 tower sites, where eddy-covariance (EC) sensors are operated to determine the surface–atmosphere exchange of momentum, heat, water, and CO 2 . EC tower networks such as AmeriFlux, the Integrated Carbon Observation System (ICOS), and NEON are vital for providing the distributed observations to address interactions at the soil–vegetation–atmosphere interface. NEON represents the largest single-provider EC network globally, with standardized observations and data processing explicitly designed for intersite comparability and analysis of feedbacks across multiple spatial and temporal scales. Furthermore, EC is tightly integrated with soil, meteorology, atmospheric chemistry, isotope, phenology, and rich contextual observations such as airborne remote sensing and in situ sampling bouts. Here, we present an overview of NEON’s observational design, field operation, and data processing that yield community resources for the study of surface–atmosphere interactions. Near-real-time data products become available from the NEON Data Portal, and EC and meteorological data are ingested into AmeriFlux and FLUXNET globally harmonized data releases. Open-source software for reproducible, extensible, and portable data analysis includes the eddy4R family of R packages underlying the EC data product generation. These resources strive to integrate with existing infrastructures and networks, to suggest novel systemic solutions, and to synergize ongoing research efforts across science communities. 
    more » « less
  2. Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf ™ is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons ™ Association. We present the results from the first submission round including a diverse set of some of the world’s largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization and communication scheduling enabling overall >10× (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system’s memory hierarchy and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch-sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O and network behaviour to parameterize extended roofline performance models in future rounds. 
    more » « less
  3. Abstract

    Carbon fluxes in terrestrial ecosystems and their response to environmental change are a major source of uncertainty in the modern carbon cycle. The National Ecological Observatory Network (NEON) presents the opportunity to merge eddy covariance (EC)‐derived fluxes with CO2isotope ratio measurements to gain insights into carbon cycle processes. Collected continuously and consistently across >40 sites, NEON EC and isotope data facilitate novel integrative analyses. However, currently provisioned atmospheric isotope data are uncalibrated, greatly limiting ability to perform cross‐site analyses. Here, we present two approaches to calibrating NEON CO2isotope ratios, along with an R package to calibrate NEON data. We find that calibrating CO2isotopologues independently yields a lowerδ13C bias (<0.05‰) and higher precision (<0.40‰) than directly correctingδ13C with linear regression (bias: <0.11‰, precision: 0.42‰), but with slightly higher error and lower precision in calibrated CO2mole fraction. The magnitude of the corrections toδ13C and CO2mole fractions vary substantially by site, underscoring the need for users to apply a consistent calibration framework to data in the NEON archive. Post‐calibration data sets show that site mean annualδ13C correlates negatively with precipitation, temperature, and aridity, but positively with elevation. Forested and agricultural ecosystems exhibit larger gradients in CO2andδ13C than other sites, particularly during the summer and at night. The overview and analysis tools developed here will facilitate cross‐site analysis using NEON data, provide a model for other continental‐scale observational networks, and enable new advances leveraging the isotope ratios of specific carbon fluxes.

    more » « less
  4. The small sizes of most marine plankton necessitate that plankton sampling occur on fine spatial scales, yet our questions often span large spatial areas. Underwater imaging can provide a solution to this sampling conundrum but collects large quantities of data that require an automated approach to image analysis. Machine learning for plankton classification, and high-performance computing (HPC) infrastructure, are critical to rapid image processing; however, these assets, especially HPC infrastructure, are only available post-cruise leading to an ‘after-the-fact’ view of plankton community structure. To be responsive to the often-ephemeral nature of oceanographic features and species assemblages in highly dynamic current systems, real-time data are key for adaptive oceanographic sampling. Here we used the new In-situ Ichthyoplankton Imaging System-3 (ISIIS-3) in the Northern California Current (NCC) in conjunction with an edge server to classify imaged plankton in real-time into 170 classes. This capability together with data visualization in a dashboard makes adaptive real-time decision-making and sampling at sea possible. Dual ISIIS-Deep-focus Particle Imager (DPI) cameras sample 180 L s -1 , leading to >10 GB of video per min. Imaged organisms are in the size range of 250 µm to 15 cm and include abundant crustaceans, fragile taxa (e.g., hydromedusae, salps), faster swimmers (e.g., krill), and rarer taxa (e.g., larval fishes). A deep learning pipeline deployed on the edge server used multithreaded CPU-based segmentation and GPU-based classification to process the imagery. AVI videos contain 50 sec of data and can contain between 23,000 - 225,000 particle and plankton segments. Processing one AVI through segmentation and classification takes on average 3.75 mins, depending on biological productivity. A heavyDB database monitors for newly processed data and is linked to a dashboard for interactive data visualization. We describe several examples where imaging, AI, and data visualization enable adaptive sampling that can have a transformative effect on oceanography. We envision AI-enabled adaptive sampling to have a high impact on our ability to resolve biological responses to important oceanographic features in the NCC, such as oxygen minimum zones, or harmful algal bloom thin layers, which affect the health of the ecosystem, fisheries, and local communities. 
    more » « less
  5. Abstract

    Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.

    Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.

    The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.

    The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.

    more » « less