skip to main content

Title: Estimating location parameters in sample-heterogeneous distributions
Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘low-noise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.
Authors:
; ;
Award ID(s):
1749857 1907786
Publication Date:
NSF-PAR ID:
10289004
Journal Name:
Information and Inference: A Journal of the IMA
ISSN:
2049-8772
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasi-gradients’. Whenever these quasi-gradients exist, a large family of no-regret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any first-order stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $\epsilon < 1/3$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasi-gradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $O(\sqrt{\epsilon })$ to the optimal rate of $O(\epsilon )$ formore »small $\epsilon $ assuming certified hypercontractivity. For mean estimation with near-identity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $\tilde{O}(d/\epsilon ^2)$.« less
  2. We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise — where an ε-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error O(ε) in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of and the running time is polynomial in d and 1/ε. When both the mean and covariance are unknown, the running time is polynomial in d and quasipolynomial in 1/ε. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings.
  3. Population pharmacokinetic (PK) modeling has become a cornerstone of drug development and optimal patient dosing. This approach offers great benefits for datasets with sparse sampling, such as in pediatric patients, and can describe between-patient variability. While most current algorithms assume normal or log-normal distributions for PK parameters, we present a mathematically consistent nonparametric maximum likelihood (NPML) method for estimating multivariate mixing distributions without any assumption about the shape of the distribution. This approach can handle distributions with any shape for all PK parameters. It is shown in convexity theory that the NPML estimator is discrete, meaning that it has finite number of points with nonzero probability. In fact, there are at most N points where N is the number of observed subjects. The original infinite NPML problem then becomes the finite dimensional problem of finding the location and probability of the support points. In the simplest case, each point essentially represents the set of PK parameters for one patient. The probability of the points is found by a primal-dual interior-point method; the location of the support points is found by an adaptive grid method. Our method is able to handle high-dimensional and complex multivariate mixture models. An important application ismore »discussed for the problem of population pharmacokinetics and a nontrivial example is treated. Our algorithm has been successfully applied in hundreds of published pharmacometric studies. In addition to population pharmacokinetics, this research also applies to empirical Bayes estimation and many other areas of applied mathematics. Thereby, this approach presents an important addition to the pharmacometric toolbox for drug development and optimal patient dosing.« less
  4. We study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be arbitrarily corrupted. In such a setup, we aim to design statistically optimal estimators in a high-dimensional scaling in which the number of nodes p, the number of edges k and the maximal node degree d are allowed to increase to infinity as a function of the sample size n. Our analysis is based on exploiting moments of the underlying distribution, coupled with novel reductions to univariate estimation. Our proposed estimators achieve an optimal dimension independent dependence on the fraction of corrupted data in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample-complexity. We corroborate our theoretical results by simulations.
  5. In this paper, we propose improved estimation method for logistic regression based on subsamples taken according the optimal subsampling probabilities developed in Wang et al. (2018). Both asymptotic results and numerical results show that the new estimator has a higher estimation efficiency. We also develop a new algorithm based on Poisson subsampling, which does not require to approximate the optimal subsampling probabilities all at once. This is computationally advantageous when available random-access memory is not enough to hold the full data. Interestingly, asymptotic distributions also show that Poisson subsampling produces a more efficient estimator if the sampling ratio, the ratio of the subsample size to the full data sample size, does not converge to zero. We also obtain the unconditional asymptotic distribution for the estimator based on Poisson subsampling. Pilot estimators are required to calculate subsampling probabilities and to correct biases in un-weighted estimators; interestingly, even if pilot estimators are inconsistent, the proposed method still produce consistent and asymptotically normal estimators.