skip to main content


Title: Robust and efficient mean estimation: approach based on the properties of self-normalized sums
Let X be a random variable with unknown mean and finite variance. We present a new estimator of the mean of X that is robust with respect to the possible presence of outliers in the sample, provides tight sub-Gaussian deviation guarantees without any additional assumptions on the shape or tails of the distribution, and moreover is asymptotically efficient. This is the first estimator that provably combines all these qualities in one package. Our construction is inspired by robustness properties possessed by the self-normalized sums. Finally, theoretical findings are supplemented by numerical simulations highlighting the strong performance of the proposed estimator in comparison with previously known techniques.  more » « less
Award ID(s):
1712956 1908905
NSF-PAR ID:
10204545
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ArXivorg
ISSN:
2331-8422
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the statistical connection between the quantized representation of a high dimensional signal X using a random spherical code and the observation of X under an additive white Gaussian noise (AWGN). We show that given X, the conditional Wasserstein distance between its bitrate-R quantized version and its observation under AWGN of signal-to-noise ratio 2^{2R - 1} is sub-linear in the problem dimension. We then utilize this fact to connect the mean squared error (MSE) attained by an estimator based on an AWGN-corrupted version of X to the MSE attained by the same estimator when fed with its bitrate-R quantized version. 
    more » « less
  2. ABSTRACT

    Observations of the cosmic 21-cm power spectrum (PS) are starting to enable precision Bayesian inference of galaxy properties and physical cosmology, during the first billion years of our Universe. Here we investigate the impact of common approximations about the likelihood used in such inferences, including: (i) assuming a Gaussian functional form; (ii) estimating the mean from a single realization; and (iii) estimating the (co)variance at a single point in parameter space. We compare ‘classical’ inference that uses an explicit likelihood with simulation-based inference (SBI) that estimates the likelihood from a training set. Our forward models include: (i) realizations of the cosmic 21-cm signal computed with 21cmFAST by varying ultraviolet (UV) and X-ray galaxy parameters together with the initial conditions; (ii) realizations of the telescope noise corresponding to a $1000 \, \mathrm{h}$ integration with the low-frequency component of the Square Kilometre Array (SKA1-Low); and (iii) the excision of Fourier modes corresponding to a foreground-dominated horizon ‘wedge’. We find that the 1D PS likelihood is well described by a Gaussian accounting for covariances between wave modes and redshift bins (higher order correlations are small). However, common approaches of estimating the forward-modelled mean and (co)variance from a random realization or at a single point in parameter space result in biased and overconstrained posteriors. Our best results come from using SBI to fit a non-Gaussian likelihood with a Gaussian mixture neural density estimator. Such SBI can be performed with up to an order of magnitude fewer simulations than classical, explicit likelihood inference. Thus SBI provides accurate posteriors at a comparably low computational cost.

     
    more » « less
  3. Abstract

    Phenology is one of the most immediate responses to global climate change, but data limitations have made examining phenology patterns across greater taxonomic, spatial and temporal scales challenging. One significant opportunity is leveraging rapidly increasing data resources from digitized museum specimens and community science platforms, but this assumes reliable statistical methods are available to estimate phenology using presence‐only data. Estimating the onset or offset of key events is especially difficult with incidental data, as lower data densities occur towards the tails of an abundance distribution.

    The Weibull distribution has been recognized as an appropriate distribution to estimate phenology based on presence‐only data, but Weibull‐informed estimators are only available for onset and offset. We describe the mathematical framework for a new Weibull‐parameterized estimator of phenology appropriate for any percentile of a distribution and make it available in anrpackage,phenesse. We use simulations and empirical data on open flower timing and first arrival of monarch butterflies to quantify the accuracy of our estimator and other commonly used phenological estimators for 10 phenological metrics: onset, mean and offset dates, as well as the 1st, 5th, 10th, 50th, 90th, 95th and 99th percentile dates. Root mean squared errors and mean bias of the phenological estimators were calculated for different patterns of abundance and observation processes.

    Results show a general pattern of decay in performance of estimates when moving from mean estimates towards the tails of the seasonal abundance curve, suggesting that onset and offset continue to be the most difficult phenometrics to estimate. However, with simple phenologies and enough observations, our newly developed estimator can provide useful onset and offset estimates. This is especially true for the start of the season, when incidental observations may be more common.

    Our simulation demonstrates the potential of generating accurate phenological estimates from presence‐only data and guides the best use of estimators. The estimator that we developed, phenesse, is the least biased and has the lowest estimation error for onset estimates under most simulated and empirical conditions examined, improving the robustness of these estimates for phenological research.

     
    more » « less
  4. The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele, 2009; Biau et al., 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing nite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application. 
    more » « less
  5. Summary

    We establish the inferential properties of the mean‐difference estimator for the average treatment effect in randomised experiments where each unit in a population is randomised to one of two treatments and then units within treatment groups are randomly sampled. The properties of this estimator are well understood in the experimental design scenario where first units are randomly sampled and then treatment is randomly assigned but not for the aforementioned scenario where the sampling and treatment assignment stages are reversed. We find that the inferential properties of the mean‐difference estimator under this experimental design scenario are identical to those under the more common sample‐first‐randomise‐second design. This finding will bring some clarifications about sampling‐based randomised designs for causal inference, particularly for settings where there is a finite super‐population. Finally, we explore to what extent pre‐treatment measurements can be used to improve upon the mean‐difference estimator for this randomise‐first‐sample‐second design. Unfortunately, we find that pre‐treatment measurements are often unhelpful in improving the precision of average treatment effect estimators under this design, unless a large number of pre‐treatment measurements that are highly associative with the post‐treatment measurements can be obtained. We confirm these results using a simulation study based on a real experiment in nanomaterials.

     
    more » « less