skip to main content


Title: U-statistics of growing order and sub-Gaussian mean estimators with sharp constants
This paper addresses the following question: given a sample of i.i.d. random variables with finite variance, can one construct an estimator of the unknown mean that performs nearly as well as if the data were normally distributed? One of the most popular examples achieving this goal is the median of means estimator. However, it is inefficient in a sense that the constants in the resulting bounds are suboptimal. We show that a permutation-invariant modification of the median of means estimator admits deviation guarantees that are sharp up to $1+o(1)$ factor if the underlying distribution possesses more than $\frac{3+\sqrt{5}}{2}\approx 2.62$ moments and is absolutely continuous with respect to the Lebesgue measure. This result yields potential improvements for a variety of algorithms that rely on the median of means estimator as a building block. At the core of our argument is are the new deviation inequalities for the U-statistics of order that is allowed to grow with the sample size, a result that could be of independent interest.  more » « less
Award ID(s):
2045068 1908905
NSF-PAR ID:
10423177
Author(s) / Creator(s):
Date Published:
Journal Name:
Mathematical statistics and learning
ISSN:
2520-2324
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper is devoted to the statistical properties of the geometric median, a robust measure of centrality for multivariate data, as well as its applications to the problem of mean estimation via the median of means principle. Our main theoretical results include (a) the upper bound for the distance between the mean and the median for general absolutely continuous distributions in $\mathbb R^d$, and examples of specific classes of distributions for which these bounds do not depend on the ambient dimension $d$; (b) exponential deviation inequalities for the distance between the sample and the population versions of the geometric median, which again depend only on the trace-type quantities and not on the ambient dimension. As a corollary, we deduce the improved bounds for the multivariate median of means estimator that hold for large classes of heavy-tailed distributions. 
    more » « less
  2. The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, Õ (d/ϵ2) samples suffice to compute the unknown vectors to within ϵ in Mahalanobis distance, where d is the dimension. In particular, the error rate of the EM based estimator is Õ (dn‾‾√) where n is the number of samples, which is optimal up to logarithmic factors. 
    more » « less
  3. Summary

    Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed.

     
    more » « less
  4. The goal of this note is to present a modification of the popular median of means estimator that achieves sub-Gaussian deviation bounds with nearly optimal constants under minimal assumptions on the underlying distribution. We build on the recent work on the topic and prove that desired guarantees can be attained under weaker requirements. 
    more » « less
  5. We consider the setting of an aggregate data meta-analysis of a continuous outcome of interest. When the distribution of the outcome is skewed, it is often the case that some primary studies report the sample mean and standard deviation of the outcome and other studies report the sample median along with the first and third quartiles and/or minimum and maximum values. To perform meta-analysis in this context, a number of approaches have recently been developed to impute the sample mean and standard deviation from studies reporting medians. Then, standard meta-analytic approaches with inverse-variance weighting are applied based on the (imputed) study-specific sample means and standard deviations. In this article, we illustrate how this common practice can severely underestimate the within-study standard errors, which results in poor coverage for the pooled mean in common effect meta-analyses and overestimation of between-study heterogeneity in random effects meta-analyses. We propose a straightforward bootstrap approach to estimate the standard errors of the imputed sample means. Our simulation study illustrates how the proposed approach can improve the estimation of the within-study standard errors and consequently improve coverage for the pooled mean in common effect meta-analyses and estimation of between-study heterogeneity in random effects meta-analyses. Moreover, we apply the proposed approach in a meta-analysis to identify risk factors of a severe course of COVID-19.

     
    more » « less