skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Sibling Regression for Generalized Linear Models
Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.  more » « less
Award ID(s):
1749854
PAR ID:
10359644
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Many ecological studies and conservation policies are based on field observations of species, which can be affected by systematic variability introduced by the observation process. A recently introduced causal modeling technique called 'half-sibling regression' can detect and correct for systematic errors in measurements of multiple independent random variables. However, it will remove intrinsic variability if the variables are dependent, and therefore does not apply to many situations, including modeling of species counts that are controlled by common causes. We present a technique called 'three-quarter sibling regression' to partially overcome this limitation. It can filter the effect of systematic noise when the latent variables have observed common causes. We provide theoretical justification of this approach, demonstrate its effectiveness on synthetic data, and show that it reduces systematic detection variability due to moon brightness in moth surveys. 
    more » « less
  2. Abstract Upcoming cosmic shear analyses will precisely measure the cosmic matter distribution at low redshifts. At these redshifts, the matter distribution is affected by galaxy formation physics, primarily baryonic feedback from star formation and active galactic nuclei. Employing measurements from theMagneticumandIllustrisTNGsimulations and a dark matter + baryon (DMB) halo model, this paper demonstrates that Sunyaev-Zel'dovich (SZ) effect observations of galaxy clusters, whose masses have been calibrated using weak gravitational lensing, can constrain the baryonic impact on cosmic shear with statistical and systematic errors subdominant to the measurement errors of DES-Y3 and LSST-Y1, with systematic errors on S8and Ωmreaching 10% and 50% of the statistical errors, respectively. For LSST-Y6 and Roman surveys, these systematic errors increase to 150% and 100% of the statistical errors, indicating the necessity for further model developments for future surveys. We further dissect the contributions from different scales and halos with different masses to cosmic shear, highlighting the dominant role of SZ clusters at scales critical for cosmic shear analyses. These findings suggest a promising avenue for future joint analyses of Cosmic Microwave Background (CMB) and lensing surveys. 
    more » « less
  3. Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives. 
    more » « less
  4. Abstract We investigate the potential of using a sample of very high-redshift (2 ≲z≲ 6) (VHZ) Type Ia supernovae (SNe Ia) attainable by JWST on constraining cosmological parameters. At such high redshifts, the age of the universe is young enough that the VHZ SN Ia sample comprises the very first SNe Ia of the universe, with progenitors among the very first generation of low-mass stars that the universe has made. We show that the VHZ SNe Ia can be used to disentangle systematic effects due to the luminosity distance evolution with redshifts intrinsic to SN Ia standardization. Assuming that the systematic evolution can be described by a linear or logarithmic formula, we found that the coefficients of this dependence can be determined accurately and decoupled from cosmological models. Systematic evolution as large as 0.15 mag and 0.45 mag out toz= 5 can be robustly separated from popular cosmological models for linear and logarithmic evolution, respectively. The VHZ SNe Ia will lay the foundation for quantifying the systematic redshift evolution of SN Ia luminosity distance scales. When combined with SN Ia surveys at comparatively lower redshifts, the VHZ SNe Ia allow for the precise measurement of the history of the expansion of the universe fromz∼ 0 to the epoch approaching reionization. 
    more » « less
  5. Abstract. In sequential estimation methods often used in oceanic and general climatecalculations of the state and of forecasts, observations act mathematicallyand statistically as source or sink terms in conservation equations for heat, salt, mass, and momentum.These artificial terms obscure the inference of the system's variability or secular changes.Furthermore, for the purposes of calculating changes inimportant functions of state variables such as total mass and energy orvolumetric current transports, results of both filter and smoother-based estimates are sensitive to misrepresentationof a large variety of parameters, including initial conditions, prioruncertainty covariances, and systematic and random errors in observations.Here, toy models of a coupled mass–spring oscillator system and of a barotropic Rossby wave system are used todemonstrate many of the issues that arise from such misrepresentations.Results from Kalman filter estimates and those from finite intervalsmoothing are analyzed.In the filter (and prediction) problem, entry of data leads to violation ofconservation and other invariant rules.A finite interval smoothing method restores the conservation rules, butuncertainties in all such estimation results remain. Convincing trend andother time-dependent determinations in “reanalysis-like” estimates require a full understanding of models, observations, and underlying error structures. Application of smoother-type methods that are designed for optimal reconstruction purposes alleviate some of the issues. 
    more » « less