skip to main content


Title: Deconvolution Methods for Non-Parametric Inference in two-level Mixed Models
Summary

We develop a general non-parametric approach to the analysis of clustered data via random effects. Assuming only that the link function is known, the regression functions and the distributions of both cluster means and observation errors are treated non-parametrically. Our argument proceeds by viewing the observation error at the cluster mean level as though it were a measurement error in an errors-in-variables problem, and using a deconvolution argument to access the distribution of the cluster mean. A Fourier deconvolution approach could be used if the distribution of the error-in-variables were known. In practice it is unknown, of course, but it can be estimated from repeated measurements, and in this way deconvolution can be achieved in an approximate sense. This argument might be interpreted as implying that large numbers of replicates are necessary for each cluster mean distribution, but that is not so; we avoid this requirement by incorporating statistical smoothing over values of nearby explanatory variables. Empirical rules are developed for the choice of smoothing parameter. Numerical simulations, and an application to real data, demonstrate small sample performance for this package of methodology. We also develop theory establishing statistical consistency.

 
more » « less
NSF-PAR ID:
10404047
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
71
Issue:
3
ISSN:
1369-7412
Page Range / eLocation ID:
p. 703-718
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Phenology is one of the most immediate responses to global climate change, but data limitations have made examining phenology patterns across greater taxonomic, spatial and temporal scales challenging. One significant opportunity is leveraging rapidly increasing data resources from digitized museum specimens and community science platforms, but this assumes reliable statistical methods are available to estimate phenology using presence‐only data. Estimating the onset or offset of key events is especially difficult with incidental data, as lower data densities occur towards the tails of an abundance distribution.

    The Weibull distribution has been recognized as an appropriate distribution to estimate phenology based on presence‐only data, but Weibull‐informed estimators are only available for onset and offset. We describe the mathematical framework for a new Weibull‐parameterized estimator of phenology appropriate for any percentile of a distribution and make it available in anrpackage,phenesse. We use simulations and empirical data on open flower timing and first arrival of monarch butterflies to quantify the accuracy of our estimator and other commonly used phenological estimators for 10 phenological metrics: onset, mean and offset dates, as well as the 1st, 5th, 10th, 50th, 90th, 95th and 99th percentile dates. Root mean squared errors and mean bias of the phenological estimators were calculated for different patterns of abundance and observation processes.

    Results show a general pattern of decay in performance of estimates when moving from mean estimates towards the tails of the seasonal abundance curve, suggesting that onset and offset continue to be the most difficult phenometrics to estimate. However, with simple phenologies and enough observations, our newly developed estimator can provide useful onset and offset estimates. This is especially true for the start of the season, when incidental observations may be more common.

    Our simulation demonstrates the potential of generating accurate phenological estimates from presence‐only data and guides the best use of estimators. The estimator that we developed, phenesse, is the least biased and has the lowest estimation error for onset estimates under most simulated and empirical conditions examined, improving the robustness of these estimates for phenological research.

     
    more » « less
  2. Abstract

    The use of hydro‐meteorological forecasts in water resources management holds great promise as a soft pathway to improve system performance. Methods for generating synthetic forecasts of hydro‐meteorological variables are crucial for robust validation of forecast use, as numerical weather prediction hindcasts are only available for a relatively short period (10–40 years) that is insufficient for assessing risk related to forecast‐informed decision‐making during extreme events. We develop a generalized error model for synthetic forecast generation that is applicable to a range of forecasted variables used in water resources management. The approach samples from the distribution of forecast errors over the available hindcast period and adds them to long records of observed data to generate synthetic forecasts. The approach utilizes the Skew Generalized Error Distribution (SGED) to model marginal distributions of forecast errors that can exhibit heteroskedastic, auto‐correlated, and non‐Gaussian behavior. An empirical copula is used to capture covariance between variables, forecast lead times, and across space. We demonstrate the method for medium‐range forecasts across Northern California in two case studies for (1) streamflow and (2) temperature and precipitation, which are based on hindcasts from the NOAA/NWS Hydrologic Ensemble Forecast System (HEFS) and the NCEP GEFS/R V2 climate model, respectively. The case studies highlight the flexibility of the model and its ability to emulate space‐time structures in forecasts at scales critical for water resources management. The proposed method is generalizable to other locations and computationally efficient, enabling fast generation of long synthetic forecast ensembles that are appropriate for risk analysis.

     
    more » « less
  3. Abstract Particle filters avoid parametric estimates for Bayesian posterior densities, which alleviates Gaussian assumptions in nonlinear regimes. These methods, however, are more sensitive to sampling errors than Gaussian-based techniques such as ensemble Kalman filters. A recent study by the authors introduced an iterative strategy for particle filters that match posterior moments—where iterations improve the filter’s ability to draw samples from non-Gaussian posterior densities. The iterations follow from a factorization of particle weights, providing a natural framework for combining particle filters with alternative filters to mitigate the impact of sampling errors. The current study introduces a novel approach to forming an adaptive hybrid data assimilation methodology, exploiting the theoretical strengths of nonparametric and parametric filters. At each data assimilation cycle, the iterative particle filter performs a sequence of updates while the prior sample distribution is non-Gaussian, then an ensemble Kalman filter provides the final adjustment when Gaussian distributions for marginal quantities are detected. The method employs the Shapiro–Wilk test to determine when to make the transition between filter algorithms, which has outstanding power for detecting departures from normality. Experiments using low-dimensional models demonstrate that the approach has a significant value, especially for nonhomogeneous observation networks and unknown model process errors. Moreover, hybrid factors are extended to consider marginals of more than one collocated variables using a test for multivariate normality. Findings from this study motivate the use of the proposed method for geophysical problems characterized by diverse observation networks and various dynamic instabilities, such as numerical weather prediction models. Significance Statement Data assimilation statistically processes observation errors and model forecast errors to provide optimal initial conditions for the forecast, playing a critical role in numerical weather forecasting. The ensemble Kalman filter, which has been widely adopted and developed in many operational centers, assumes Gaussianity of the prior distribution and solves a linear system of equations, leading to bias in strong nonlinear regimes. On the other hand, particle filters avoid many of those assumptions but are sensitive to sampling errors and are computationally expensive. We propose an adaptive hybrid strategy that combines their advantages and minimizes the disadvantages of the two methods. The hybrid particle filter–ensemble Kalman filter is achieved with the Shapiro–Wilk test to detect the Gaussianity of the ensemble members and determine the timing of the transition between these filter updates. Demonstrations in this study show that the proposed method is advantageous when observations are heterogeneous and when the model has an unknown bias. Furthermore, by extending the statistical hypothesis test to the test for multivariate normality, we consider marginals of more than one collocated variable. These results encourage further testing for real geophysical problems characterized by various dynamic instabilities, such as real numerical weather prediction models. 
    more » « less
  4. Abstract

    Gridded monthly rainfall estimates can be used for a number of research applications, including hydrologic modeling and weather forecasting. Automated interpolation algorithms, such as the “autoKrige” function in R, can produce gridded rainfall estimates that validate well but produce unrealistic spatial patterns. In this work, an optimized geostatistical kriging approach is used to interpolate relative rainfall anomalies, which are then combined with long-term means to develop the gridded estimates. The optimization consists of the following: 1) determining the most appropriate offset (constant) to use when log-transforming data; 2) eliminating poor quality data prior to interpolation; 3) detecting erroneous maps using a machine learning algorithm; and 4) selecting the most appropriate parameterization scheme for fitting the model used in the interpolation. Results of this effort include a 30-yr (1990–2019), high-resolution (250-m) gridded monthly rainfall time series for the state of Hawai‘i. Leave-one-out cross validation (LOOCV) is performed using an extensive network of 622 observation stations. LOOCV results are in good agreement with observations (R2= 0.78; MAE = 55 mm month−1; 1.4%); however, predictions can underestimate high rainfall observations (bias = 34 mm month−1; −1%) due to a well-known smoothing effect that occurs with kriging. This research highlights the fact that validation statistics should not be the sole source of error assessment and that default parameterizations for automated interpolation may need to be modified to produce realistic gridded rainfall surfaces. Data products can be accessed through the Hawai‘i Data Climate Portal (HCDP;http://www.hawaii.edu/climate-data-portal).

    Significance Statement

    A new method is developed to map rainfall in Hawai‘i using an optimized geostatistical kriging approach. A machine learning technique is used to detect erroneous rainfall maps and several conditions are implemented to select the optimal parameterization scheme for fitting the model used in the kriging interpolation. A key finding is that optimization of the interpolation approach is necessary because maps may validate well but have unrealistic spatial patterns. This approach demonstrates how, with a moderate amount of data, a low-level machine learning algorithm can be trained to evaluate and classify an unrealistic map output.

     
    more » « less
  5. null (Ed.)
    The use of hydro-meteorological forecasts in water resources management holds great promise as a soft pathway to improve system performance. Methods for generating synthetic forecasts of hydro-meteorological variables are crucial for robust validation of forecast use, as numerical weather prediction hindcasts are only available for a relatively short period (10–40 years) that is insufficient for assessing risk related to forecast-informed decision-making during extreme events. We develop a generalized error model for synthetic forecast generation that is applicable to a range of forecasted variables used in water resources management. The approach samples from the distribution of forecast errors over the available hindcast period and adds them to long records of observed data to generate synthetic forecasts. The approach utilizes the Skew Generalized Error Distribution (SGED) to model marginal distributions of forecast errors that can exhibit heteroskedastic, auto-correlated, and non-Gaussian behavior. An empirical copula is used to capture covariance between variables, forecast lead times, and across space. We demonstrate the method for medium-range forecasts across Northern California in two case studies for (1) streamflow and (2) temperature and precipitation, which are based on hindcasts from the NOAA/NWS Hydrologic Ensemble Forecast System (HEFS) and the NCEP GEFS/R V2 climate model, respectively. The case studies highlight the flexibility of the model and its ability to emulate space-time structures in forecasts at scales critical for water resources management. The proposed method is generalizable to other locations and computationally efficient, enabling fast generation of long synthetic forecast ensembles that are appropriate for risk analysis. 
    more » « less