skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Three-quarter Sibling Regression for Denoising Observational Data
Many ecological studies and conservation policies are based on field observations of species, which can be affected by systematic variability introduced by the observation process. A recently introduced causal modeling technique called 'half-sibling regression' can detect and correct for systematic errors in measurements of multiple independent random variables. However, it will remove intrinsic variability if the variables are dependent, and therefore does not apply to many situations, including modeling of species counts that are controlled by common causes. We present a technique called 'three-quarter sibling regression' to partially overcome this limitation. It can filter the effect of systematic noise when the latent variables have observed common causes. We provide theoretical justification of this approach, demonstrate its effectiveness on synthetic data, and show that it reduces systematic detection variability due to moon brightness in moth surveys.  more » « less
Award ID(s):
1749854
PAR ID:
10199761
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Page Range / eLocation ID:
5960 to 5966
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Neighborhood models have allowed us to test many hypotheses regarding the drivers of variation in tree growth, but require considerable computation due to the many empirically supported non-linear relationships they include. Regularized regression represents a far more efficient neighborhood modeling method, but it is unclear whether such an ecologically unrealistic model can provide accurate insights on tree growth. Rapid computation is becoming increasingly important as ecological datasets grow in size, and may be essential when using neighborhood models to predict tree growth beyond sample plots or into the future. We built a novel regularized regression model of tree growth and investigated whether it reached the same conclusions as a commonly used neighborhood model, regarding hypotheses of how tree growth is influenced by the species identity of neighboring trees. We also evaluated the ability of both models to interpolate the growth of trees not included in the model fitting dataset. Our regularized regression model replicated most of the classical model’s inferences in a fraction of the time without using high-performance computing resources. We found that both methods could interpolate out-of-sample tree growth, but the method making the most accurate predictions varied among focal species. Regularized regression is particularly efficient for comparing hypotheses because it automates the process of model selection and can handle correlated explanatory variables. This feature means that regularized regression could also be used to select among potential explanatory variables (e.g., climate variables) and thereby streamline the development of a classical neighborhood model. Both regularized regression and classical methods can interpolate out-of-sample tree growth, but future research must determine whether predictions can be extrapolated to trees experiencing novel conditions. Overall, we conclude that regularized regression methods can complement classical methods in the investigation of tree growth drivers and represent a valuable tool for advancing this field toward prediction. 
    more » « less
  2. Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys. 
    more » « less
  3. null (Ed.)
    When forest conditions are mapped from empirical models, uncertainty in remotely sensed predictor variables can cause the systematic overestimation of low values, underestimation of high values, and suppression of variability. This regression dilution or attenuation bias is a well-recognized problem in remote sensing applications, with few practical solutions. Attenuation is of particular concern for applications that are responsive to prediction patterns at the high end of observed data ranges, where systematic error is typically greatest. We addressed attenuation bias in models of tree species relative abundance (percent of total aboveground live biomass) based on multitemporal Landsat and topoclimatic predictor data. We developed a multi-objective support vector regression (MOSVR) algorithm that simultaneously minimizes total prediction error and systematic error caused by attenuation bias. Applied to 13 tree species in the Acadian Forest Region of the northeastern U.S., MOSVR performed well compared to other prediction methods including single-objective SVR (SOSVR) minimizing total error, Random Forest (RF), gradient nearest neighbor (GNN), and Random Forest nearest neighbor (RFNN) algorithms. SOSVR and RF yielded the lowest total prediction error but produced the greatest systematic error, consistent with strong attenuation bias. Underestimation at high relative abundance caused strong deviations between predicted patterns of species dominance/codominance and those observed at field plots. In contrast, GNN and RFNN produced dominance/codominance patterns that deviated little from observed patterns, but predicted species relative abundance with lower accuracy and substantial systematic error. MOSVR produced the least systematic error for all species with total error often comparable to SOSVR or RF. Predicted patterns of dominance/codominance matched observations well, though not quite as well as GNN or RFNN. Overall, MOSVR provides an effective machine learning approach to the reduction of systematic prediction error and should be fully generalizable to other remote sensing applications and prediction problems. 
    more » « less
  4. Abstract Spatial synchrony is defined by related fluctuations through time in population abundances measured at different locations. The degree of relatedness typically declines with increasing distance between sampling locations. Standard approaches for assessing synchrony assume isotropy in space and uniformity across timescales of analysis, but it is now known that spatial variability and timescale structure in population dynamics are common features. We tested for spatial and timescale structure in the patterns of synchrony of freshwater plankton in Kentucky Lake, U.S.A. We also evaluated whether different mechanisms may drive synchrony and its spatial structure on different timescales. Using wavelet techniques and matrix regression, we analyzed phytoplankton biomass and abundances of seven zooplankton taxa at 16 locations sampled from 1990 to 2015. We found that zooplankton abundances and phytoplankton biomass exhibited synchrony at multiple timescales. Timescale structure in the potential mechanisms of synchrony was revealed primarily through networks of relationships among zooplankton taxa, which differed by timescale. We found substantial interspecific variability in geographic structures of synchrony and their causes: all mechanisms we considered strongly explained geographic structure in synchrony for at least one species, while Euclidean distance between sampling locations was generally less well supported than more mechanistic explanations. Geographic structure in synchrony and its underlying mechanisms also depended on timescale for a minority of the taxa tested. Overall, our results show substantial and complex but interpretable variation in structures of synchrony across three variables: space, timescale, and taxon. It seems likely these aspects of synchrony are important general features of freshwater systems. 
    more » « less
  5. null (Ed.)
    This paper describes an implementation of the Combined Hybrid-Parallel Prediction (CHyPP) approach of Wikner et al. (2020) on a low-resolution atmospheric global circulation model (AGCM). This approach combines a physics-based numerical model of a dynamical system (e.g., the atmosphere) with a computationally efficient type of machine learning (ML) called reservoir computing (RC) to construct a hybrid model. This hybrid atmospheric model produces more accurate forecasts of most atmospheric state variables than the host AGCM for the first 7-8 forecast days, and for even longer times for the temperature and humidity near the earth's surface. It also produces more accurate forecasts than a purely ML-based model, or a model that combines linear regression, rather than ML, with the AGCM. The potential of the approach for climate research is demonstrated by a 10-year long hybrid model simulation of the atmospheric general circulation, which shows that the hybrid model can simulate the general circulation with substantially smaller systematic errors and more realistic variability than the host AGCM. 
    more » « less