skip to main content


Title: Ecological prediction at macroscales using big data: Does sampling design matter?
Abstract

Although ecosystems respond to global change at regional to continental scales (i.e., macroscales), model predictions of ecosystem responses often rely on data from targeted monitoring of a small proportion of sampled ecosystems within a particular geographic area. In this study, we examined how the sampling strategy used to collect data for such models influences predictive performance. We subsampled a large and spatially extensive data set to investigate how macroscale sampling strategy affects prediction of ecosystem characteristics in 6,784 lakes across a 1.8‐million‐km2area. We estimated model predictive performance for different subsets of the data set to mimic three common sampling strategies for collecting observations of ecosystem characteristics: random sampling design, stratified random sampling design, and targeted sampling. We found that sampling strategy influenced model predictive performance such that (1) stratified random sampling designs did not improve predictive performance compared to simple random sampling designs and (2) although one of the scenarios that mimicked targeted (non‐random) sampling had the poorest performing predictive models, the other targeted sampling scenarios resulted in models with similar predictive performance to that of the random sampling scenarios. Our results suggest that although potential biases in data sets from some forms of targeted sampling may limit predictive performance, compiling existing spatially extensive data sets can result in models with good predictive performance that may inform a wide range of science questions and policy goals related to global change.

 
more » « less
Award ID(s):
1638679 1638550 1638554 2306364
NSF-PAR ID:
10449303
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Ecological Applications
Volume:
30
Issue:
6
ISSN:
1051-0761
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation. 
    more » « less
  2. Elizabeth Borer (Ed.)
    Understanding spatial and temporal variation in plant traits is needed to accurately predict how communities and ecosystems will respond to global change. The National Ecological Observatory Network’s (NEON’s) Airborne Observation Platform (AOP) provides hyperspectral images and associated data products at numerous field sites at 1 m spatial resolution, potentially allowing high-resolution trait mapping. We tested the accuracy of readily available data products of NEON’s AOP, such as Leaf Area Index (LAI), Total Biomass, Ecosystem Structure (Canopy height model [CHM]), and Canopy Nitrogen, by comparing them to spatially extensive field measurements from a mesic tallgrass prairie. Correlations with AOP data products exhibited generally weak or no relationships with corresponding field measurements. The strongest relationships were between AOP LAI and ground-measured LAI (r = 0.32) and AOP Total Biomass and ground-measured biomass (r = 0.23). We also examined how well the full reflectance spectra (380–2,500 nm), as opposed to derived products, could predict vegetation traits using partial least-squares regression (PLSR) models. Among all the eight traits examined, only Nitrogen had a validation of more than 0.25. For all vegetation traits, validation ranged from 0.08 to 0.29 and the range of the root mean square error of prediction (RMSEP) was 14–64%. Our results suggest that currently available AOP-derived data products should not be used without extensive ground-based validation. Relationships using the full reflectance spectra may be more promising, although careful consideration of field and AOP data mismatches in space and/or time, biases in field-based measurements or AOP algorithms, and model uncertainty are needed. Finally, grassland sites may be especially challenging for airborne spectroscopy because of their high species diversity within a small area, mixed functional types of plant communities, and heterogeneous mosaics of disturbance and resource availability. Remote sensing observations are one of the most promising approaches to understanding ecological patterns across space and time. But the opportunity to engage a diverse community of NEON data users will depend on establishing rigorous links with in-situ field measurements across a diversity of sites. 
    more » « less
  3. Abstract

    Understanding spatial and temporal variation in plant traits is needed to accurately predict how communities and ecosystems will respond to global change. The National Ecological Observatory Network’s (NEON’s) Airborne Observation Platform (AOP) provides hyperspectral images and associated data products at numerous field sites at 1 m spatial resolution, potentially allowing high‐resolution trait mapping. We tested the accuracy of readily available data products of NEON’s AOP, such as Leaf Area Index (LAI), Total Biomass, Ecosystem Structure (Canopy height model [CHM]), and Canopy Nitrogen, by comparing them to spatially extensive field measurements from a mesic tallgrass prairie. Correlations with AOP data products exhibited generally weak or no relationships with corresponding field measurements. The strongest relationships were between AOP LAI and ground‐measured LAI (r = 0.32) and AOP Total Biomass and ground‐measured biomass (r = 0.23). We also examined how well the full reflectance spectra (380–2,500 nm), as opposed to derived products, could predict vegetation traits using partial least‐squares regression (PLSR) models. Among all the eight traits examined, only Nitrogen had a validation of more than 0.25. For all vegetation traits, validation ranged from 0.08 to 0.29 and the range of the root mean square error of prediction (RMSEP) was 14–64%. Our results suggest that currently available AOP‐derived data products should not be used without extensive ground‐based validation. Relationships using the full reflectance spectra may be more promising, although careful consideration of field and AOP data mismatches in space and/or time, biases in field‐based measurements or AOP algorithms, and model uncertainty are needed. Finally, grassland sites may be especially challenging for airborne spectroscopy because of their high species diversity within a small area, mixed functional types of plant communities, and heterogeneous mosaics of disturbance and resource availability. Remote sensing observations are one of the most promising approaches to understanding ecological patterns across space and time. But the opportunity to engage a diverse community of NEON data users will depend on establishing rigorous links with in‐situ field measurements across a diversity of sites.

     
    more » « less
  4. Abstract

    Nitrogen (N) is a key limiting nutrient in terrestrial ecosystems, but there remain critical gaps in our ability to predict and model controls on soil N cycling. This may be in part due to lack of standardized sampling across broad spatial–temporal scales. Here, we introduce a continentally distributed, publicly available data set collected by the National Ecological Observatory Network (NEON) that can help fill these gaps. First, we detail the sampling design and methods used to collect and analyze soil inorganic N pool and net flux rate data from 47 terrestrial sites. We address methodological challenges in generating a standardized data set, even for a network using uniform protocols. Then, we evaluate sources of variation within the sampling design and compare measured net N mineralization to simulated fluxes from the Community Earth System Model 2 (CESM2). We observed wide spatiotemporal variation in inorganic N pool sizes and net transformation rates. Site explained the most variation in NEON’s stratified sampling design, followed by plots within sites. Organic horizons had larger pools and net N transformation rates than mineral horizons on a sample weight basis. The majority of sites showed some degree of seasonality in N dynamics, but overall these temporal patterns were not matched by CESM2, leading to poor correspondence between observed and modeled data. Looking forward, these data can reveal new insights into controls on soil N cycling, especially in the context of other environmental data sets provided by NEON, and should be leveraged to improve predictive modeling of the soil N cycle.

     
    more » « less
  5. null (Ed.)
    Increasing deoxygenation (loss of oxygen) of the ocean, including expansion of oxygen minimum zones (OMZs), is a potentially important consequence of global warming. We examined present-day variability of vertical distributions of 23 calanoid copepod species in the Eastern Tropical North Pacific (ETNP) living in locations with different water column oxygen profiles and OMZ intensity (lowest oxygen concentration and its vertical extent in a profile). Copepods and hydrographic data were collected in vertically stratified day and night MOCNESS (Multiple Opening/Closing Net and Environmental Sensing System) tows (0–1000 m) during four cruises over a decade (2007– 2017) that sampled four ETNP locations: Costa Rica Dome, Tehuantepec Bowl, and two oceanic sites further north (21– 22 N) off Mexico. The sites had different vertical oxygen profiles: some with a shallow mixed layer, abrupt thermocline, and extensive very low oxygen OMZ core; and others with a more gradual vertical development of the OMZ (broad mixed layer and upper oxycline zone) and a less extensive OMZ core where oxygen was not as low. Calanoid copepod species (including examples from the genera Eucalanus, Pleuromamma, and Lucicutia) demonstrated different distributional strategies (implying different physiological characteristics) associated with this variability. We identified sets of species that (1) changed their vertical distributions and depth of maximum abundance associated with the depth and intensity of the OMZ and its oxycline inflection points; (2) shifted their depth of diapause; (3) adjusted their diel vertical migration, especially the nighttime upper depth; or (4) expanded or contracted their depth range within the mixed layer and upper part of the thermocline in association with the thickness of the aerobic epipelagic zone (habitat compression concept). These distribution depths changed by tens to hundreds of meters depending on the species, oxygen profile, and phenomenon. For example, at the lower oxycline, the depth of maximum abundance for Lucicutia hulsemannae shifted from  600 to  800 m, and the depth of diapause for Eucalanus inermis shifted from  500 to  775 m, in an expanded OMZ compared to a thinner OMZ, but remained at similar low oxygen levels in both situations. These species or life stages are examples of “hypoxiphilic” taxa. For the migrating copepod Pleuromamma abdominalis, its nighttime depth was shallow ( 20 m) when the aerobic mixed layer was thin and the low-oxygen OMZ broad, but it was much deeper ( 100 m) when the mixed layer and higher oxygen extended deeper; daytime depth in both situations was  300 m. Because temperature decreased with depth, these distributional depth shifts had metabolic implications. The upper ocean to mesopelagic depth range encompasses a complex interwoven ecosystem characterized by intricate relationships among its inhabitants and their environment. It is a critically important zone for oceanic biogeochemical and export processes and hosts key food web components for commercial fisheries. Among the zooplankton, there will likely be winners and losers with increasing ocean deoxygenation as species cope with environmental change. Changes in individual copepod species abundances, vertical distributions, and life history strategies may create potential perturbations to these intricate food webs and processes. Present-day variability provides a window into future scenarios and potential effects of deoxygenation. 
    more » « less