Abstract Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation.
more »
« less
Ecological prediction at macroscales using big data: Does sampling design matter?
Abstract Although ecosystems respond to global change at regional to continental scales (i.e., macroscales), model predictions of ecosystem responses often rely on data from targeted monitoring of a small proportion of sampled ecosystems within a particular geographic area. In this study, we examined how the sampling strategy used to collect data for such models influences predictive performance. We subsampled a large and spatially extensive data set to investigate how macroscale sampling strategy affects prediction of ecosystem characteristics in 6,784 lakes across a 1.8‐million‐km2area. We estimated model predictive performance for different subsets of the data set to mimic three common sampling strategies for collecting observations of ecosystem characteristics: random sampling design, stratified random sampling design, and targeted sampling. We found that sampling strategy influenced model predictive performance such that (1) stratified random sampling designs did not improve predictive performance compared to simple random sampling designs and (2) although one of the scenarios that mimicked targeted (non‐random) sampling had the poorest performing predictive models, the other targeted sampling scenarios resulted in models with similar predictive performance to that of the random sampling scenarios. Our results suggest that although potential biases in data sets from some forms of targeted sampling may limit predictive performance, compiling existing spatially extensive data sets can result in models with good predictive performance that may inform a wide range of science questions and policy goals related to global change.
more »
« less
- PAR ID:
- 10449303
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Ecological Applications
- Volume:
- 30
- Issue:
- 6
- ISSN:
- 1051-0761
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The determinants of fire-driven changes in soil organic carbon (SOC) across broad environmental gradients remains unclear, especially in global drylands. Here we combined datasets and field sampling of fire-manipulation experiments to evaluate where and why fire changes SOC and compared our statistical model to simulations from ecosystem models. Drier ecosystems experienced larger relative changes in SOC than humid ecosystems—in some cases exceeding losses from plant biomass pools—primarily explained by high fire-driven declines in tree biomass inputs in dry ecosystems. Many ecosystem models underestimated the SOC changes in drier ecosystems. Upscaling our statistical model predicted that soils in savannah–grassland regions may have gained 0.64 PgC due to net-declines in burned area over the past approximately two decades. Consequently, ongoing declines in fire frequencies have probably created an extensive carbon sink in the soils of global drylands that may have been underestimated by ecosystem models.more » « less
-
Abstract Environmental disturbances may prevent ecosystems from consistently performing their critical ecological functions. Two important properties of ecosystems are their resistance and stability, which respectively reflect their capacities to withstand and recover from disturbance events (e.g. droughts, wildfires, pests, etc). Theory suggests that resistant and stable ecosystems possess opposing characteristics, but this has seldom been established across diverse ecosystem attributes or broad spatial scales. Here, we compare the resistance and stability of >1000 protected area ecosystems in Africa to disturbance-induced losses in primary productivity from 2000 to 2019. We quantitatively evaluated each ecosystem such that following disturbances, an ecosystem is more resistant if it experiences lower-magnitude losses in productivity, and more stable if it returns more rapidly to pre-disturbance productivity levels. To compare the characteristics of resistant versus stable ecosystems, we optimized random forest models that use ecosystem attributes (representing their climatic and environmental conditions, plant and faunal biodiversity, and exposure to human impacts) to predict their resistance and, separately, stability values. We visualized each attribute’s relationship with resistance and stability after accounting for all other attributes in the model framework. Ecosystems that are more resistant to disturbances are less stable, and vice versa. The ecosystem attributes with the most predictive power in our models all exhibit contrasting relationships with resistance versus stability. Notably, highly resistant ecosystems are generally more arid and exhibit high habitat heterogeneity and mammalian biodiversity, while highly stable ecosystems are the opposite. We discuss the underlying mechanisms through which these attributes engender resistance or, conversely, stability. Our findings suggest that resistance and stability are fundamentally opposing phenomena. A balance between the two must be struck if ecosystems are to maintain their identity, structure, and function in the face of environmental change.more » « less
-
Tanentzap, Andrew J (Ed.)The ecology of forest ecosystems depends on the composition of trees. Capturing fine-grained information on individual trees at broad scales provides a unique perspective on forest ecosystems, forest restoration, and responses to disturbance. Individual tree data at wide extents promises to increase the scale of forest analysis, biogeographic research, and ecosystem monitoring without losing details on individual species composition and abundance. Computer vision using deep neural networks can convert raw sensor data into predictions of individual canopy tree species through labeled data collected by field researchers. Using over 40,000 individual tree stems as training data, we create landscape-level species predictions for over 100 million individual trees across 24 sites in the National Ecological Observatory Network (NEON). Using hierarchical multi-temporal models fine-tuned for each geographic area, we produce open-source data available as 1 km2shapefiles with individual tree species prediction, as well as crown location, crown area, and height of 81 canopy tree species. Site-specific models had an average performance of 79% accuracy covering an average of 6 species per site, ranging from 3 to 15 species per site. All predictions are openly archived and have been uploaded to Google Earth Engine to benefit the ecology community and overlay with other remote sensing assets. We outline the potential utility and limitations of these data in ecology and computer vision research, as well as strategies for improving predictions using targeted data sampling.more » « less
-
Abstract Understanding the determinants of urban forest diversity and structure is important for preserving biodiversity and sustaining ecosystem services in cities. However, comprehensive field assessments are resource‐intensive, and landscape‐level approaches may overlook heterogeneity within urban regions. To address this challenge, we combined remote sensing with field inventories to comprehensively map and analyze urban forest attributes in forest patches across the Minneapolis‐St. Paul Metropolitan Area (MSPMA) in a multistep process. First, we developed predictive machine learning models of forest attributes by integrating data from forest inventories (from 40 12.5‐m‐radius plots) with Global Ecosystem Dynamics Investigation (GEDI) observations and Sentinel‐2‐derived land surface phenology (LSP). These models enabled accurate predictions of forest attributes, specifically nine metrics of plant diversity (tree species richness, tree abundance, and understory plant abundance), structure (average canopy height, dbh, and canopy density), and structural complexity (variability in canopy height, dbh, and canopy density) with relative errors ranging between 11% and 21%. Second, we applied these machine learning models to predict diversity metrics for 804 additional plots from GEDI and Sentinel‐2. Finally, we applied Bayesian multilevel models to the predicted diversity metrics to assess the influence of multiple factors—patch dimensions, landscape attributes, plot position, and jurisdictional agency—on these forest attributes across the 804 predicted plots. The models showed all predictors have some degree of effect on forest attributes, presenting varying explanatory power withR2values ranging from 0.071 to 0.405. Overall, plot characteristics (e.g., distance to nearest trail, proximity to forest edge) and jurisdictional agency explained a large portion of the variability across patches, whereas patch and landscape characteristics did not. The relative effect of plot versus management sets of predictors on the marginal ΔR2was heterogeneous across metrics and ecological subsections (an ecological classification designation). The multiplicity of determinants influencing urban forests emphasizes the intricate nature of urban ecosystems and highlights nuanced, heterogeneous relationships between urban ecological and anthropogenic factors that determine forest properties. Effectively enhancing biodiversity in urban forests requires assessments, management, and conservation strategies tailored for context‐specific characteristics.more » « less
An official website of the United States government
