skip to main content


This content will become publicly available on September 15, 2024

Title: Smoothness-Penalized Deconvolution (SPeD) of a Density Estimate
This paper addresses the deconvolution problem of estimating a square-integrable probability density from observations contaminated with additive measurement errors having a known density. The estimator begins with a density estimate of the contaminated observations and minimizes a reconstruction error penalized by an integrated squared m-th derivative. Theory for deconvolution has mainly focused on kernel- or wavelet-based techniques, but other methods including spline-based techniques and this smoothnesspenalized estimator have been found to outperform kernel methods in simulation studies. This paper fills in some of these gaps by establishing asymptotic guarantees for the smoothness-penalized approach. Consistency is established in mean integrated squared error, and rates of convergence are derived for Gaussian, Cauchy, and Laplace error densities, attaining some lower bounds already in the literature. The assumptions are weak for most results; the estimator can be used with a broader class of error densities than the deconvoluting kernel. Our application example estimates the density of the mean cytotoxicity of certain bacterial isolates under random sampling; this mean cytotoxicity can only be measured experimentally with additive error, leading to the deconvolution problem. We also describe a method for approximating the solution by a cubic spline, which reduces to a quadratic program.  more » « less
Award ID(s):
1814840
NSF-PAR ID:
10470132
Author(s) / Creator(s):
;
Publisher / Repository:
Taylor &ill-posed problem, measurement error, density estimation, regularization Francis Online
Date Published:
Journal Name:
Journal of the American Statistical Association
ISSN:
0162-1459
Page Range / eLocation ID:
1 to 25
Subject(s) / Keyword(s):
["ill-posed problem","measurement error","density estimation","regularization"]
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We propose a fast penalized spline method for bivariate smoothing. Univariate P-spline smoothers are applied simultaneously along both co-ordinates. The new smoother has a sandwich form which suggested the name ‘sandwich smoother’ to a referee. The sandwich smoother has a tensor product structure that simplifies an asymptotic analysis and it can be fast computed. We derive a local central limit theorem for the sandwich smoother, with simple expressions for the asymptotic bias and variance, by showing that the sandwich smoother is asymptotically equivalent to a bivariate kernel regression estimator with a product kernel. As far as we are aware, this is the first central limit theorem for a bivariate spline estimator of any type. Our simulation study shows that the sandwich smoother is orders of magnitude faster to compute than other bivariate spline smoothers, even when the latter are computed by using a fast generalized linear array model algorithm, and comparable with them in terms of mean integrated squared errors. We extend the sandwich smoother to array data of higher dimensions, where a generalized linear array model algorithm improves the computational speed of the sandwich smoother. One important application of the sandwich smoother is to estimate covariance functions in functional data analysis. In this application, our numerical results show that the sandwich smoother is orders of magnitude faster than local linear regression. The speed of the sandwich formula is important because functional data sets are becoming quite large.

     
    more » « less
  2. The present paper studies density deconvolution in the presence of small Berkson errors, in particular, when the variances of the errors tend to zero as the sample size grows. It is known that when the Berkson errors are present, in some cases, the unknown density estimator can be obtained by simple averaging without using kernels. However, this may not be the case when Berkson errors are asymptotically small. By treating the former case as a kernel estimator with the zero bandwidth, we obtain the optimal expressions for the bandwidth.We show that the density of Berkson errors acts as a regularizer, so that the kernel estimator is unnecessary when the variance of Berkson errors lies above some threshold that depends on the shapes of the densities in the model and the number of observations. 
    more » « less
  3. Summary The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood is expressed in terms of the natural parameterization of the multi-parameter exponential family of a Gaussian with unknown mean and variance. The result is a remarkably simple estimator of the error variance with provably good performance in terms of mean squared error. These theoretical results do not require placing any assumptions on the design matrix or the true regression coefficients. We also propose a companion estimator, called the organic lasso, which theoretically does not require tuning of the regularization parameter. Both estimators do well empirically compared to pre-existing methods, especially in settings where successful recovery of the true support of the coefficient vector is hard. Finally, we show that existing methods can do well under fewer assumptions than previously known, thus providing a fuller story about the problem of estimating the error variance in high-dimensional linear models. 
    more » « less
  4. Abstract

    We consider estimation of the density of a multivariate response, that is not observed directly but only through measurements contaminated by additive error. Our focus is on the realistic sampling case of bivariate panel data (repeated contaminated bivariate measurements on each sample unit) with an unknown error distribution. Several factors can affect the performance of kernel deconvolution density estimators, including the choice of the kernel and the estimation approach of the unknown error distribution. As the choice of the kernel function is critically important, the class of flat-top kernels can have advantages over more commonly implemented alternatives. We describe different approaches for density estimation with multivariate panel responses, and investigate their performance through simulation. We examine competing kernel functions and describe a flat-top kernel that has not been used in deconvolution problems. Moreover, we study several nonparametric options for estimating the unknown error distribution. Finally, we also provide guidelines to the numerical implementation of kernel deconvolution in higher sampling dimensions.

     
    more » « less
  5. Abstract

    Phenology is one of the most immediate responses to global climate change, but data limitations have made examining phenology patterns across greater taxonomic, spatial and temporal scales challenging. One significant opportunity is leveraging rapidly increasing data resources from digitized museum specimens and community science platforms, but this assumes reliable statistical methods are available to estimate phenology using presence‐only data. Estimating the onset or offset of key events is especially difficult with incidental data, as lower data densities occur towards the tails of an abundance distribution.

    The Weibull distribution has been recognized as an appropriate distribution to estimate phenology based on presence‐only data, but Weibull‐informed estimators are only available for onset and offset. We describe the mathematical framework for a new Weibull‐parameterized estimator of phenology appropriate for any percentile of a distribution and make it available in anrpackage,phenesse. We use simulations and empirical data on open flower timing and first arrival of monarch butterflies to quantify the accuracy of our estimator and other commonly used phenological estimators for 10 phenological metrics: onset, mean and offset dates, as well as the 1st, 5th, 10th, 50th, 90th, 95th and 99th percentile dates. Root mean squared errors and mean bias of the phenological estimators were calculated for different patterns of abundance and observation processes.

    Results show a general pattern of decay in performance of estimates when moving from mean estimates towards the tails of the seasonal abundance curve, suggesting that onset and offset continue to be the most difficult phenometrics to estimate. However, with simple phenologies and enough observations, our newly developed estimator can provide useful onset and offset estimates. This is especially true for the start of the season, when incidental observations may be more common.

    Our simulation demonstrates the potential of generating accurate phenological estimates from presence‐only data and guides the best use of estimators. The estimator that we developed, phenesse, is the least biased and has the lowest estimation error for onset estimates under most simulated and empirical conditions examined, improving the robustness of these estimates for phenological research.

     
    more » « less