We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a dvariate normal (μ,Σ) means a samples is only revealed if it falls in some subset S⊆ℝd; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean μ and covariance matrix Σ can be estimated with arbitrary accuracy in polynomialtime, as long as we have oracle access to S, and S has nontrivial measure under the unknown dvariate normal distribution. Additionally we show that without oracle access to S, any nontrivial estimation is impossible.
Efficient Statistics, in High Dimensions, from Truncated Samples
We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a dvariate normal N(μ,Σ) means a samples is only revealed if it falls in some subset S⊆Rd; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean μ and covariance matrix Σ can be estimated with arbitrary accuracy in polynomialtime, as long as we have oracle access to S, and S has nontrivial measure under the unknown dvariate normal distribution. Additionally we show that without oracle access to S, any nontrivial estimation is impossible.
 Publication Date:
 NSFPAR ID:
 10078464
 Journal Name:
 Annual Symposium on Foundations of Computer Science
 ISSN:
 02725428
 Sponsoring Org:
 National Science Foundation
More Like this


We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a dvariate normal (μ,Σ) means a samples is only revealed if it falls in some subset S⊆ℝd; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean μ and covariance matrix Σ can be estimated with arbitrary accuracy in polynomialtime, as long as we have oracle access to S, and S has nontrivial measure under the unknown dvariate normal distribution. Additionally we show that without oracle access to S, any nontrivial estimation is impossible.

We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variabley=wTx+εand its corresponding vector ofcovariatesx∈Rkare only revealed if the dependent variable falls in some subsetS⊆R; otherwisethe existence of the pair(x,y)is hidden. This problem has remained a challenge since the earlyworks of Tobin (1958); Amemiya (1973); Hausman and Wise (1977); Breen et al. (1996), its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee,and Fisher Galton (1897); Pearson and Lee (1908); Lee (1914); Fisher (1931). While consistent estimators of the regression coefficients have been identified, the error rates are not wellunderstood,especially in highdimensional settings.Under a “thickness assumption” about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vectorwfromnrevealed samples that attains`2errorO(√k/n), recovering the guarantees of least squares in thestandard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative loglikelihood of the truncated sample, and only needs oracle access to the setS, which may otherwise be arbitrary, and in particular may be nonconvex.PSGD must be restricted to an appropriately defined convex cone to guarantee that the negativeloglikelihood is stronglymore »

Given data drawn from an unknown distribution, D, to what extent is it possible to amplify'' this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n samples'' which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a ddimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to nontrivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.

Given data drawn from an unknown distribution, D, to what extent is it possible to ``amplify'' this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n ``samples'' which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a ddimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to nontrivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of sample amplification, and formalize a number of curious directions for future research.