skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The EM Algorithm gives Sample Optimality for Learning Mixtures of Well-Separated Gaussians
We consider the problem of spherical Gaussian Mixture models with π‘˜β‰₯3 components when the components are well separated. A fundamental previous result established that separation of Ξ©(logπ‘˜β€Ύβ€Ύβ€Ύβ€Ύβ€Ύβˆš) is necessary and sufficient for identifiability of the parameters with \textit{polynomial} sample complexity (Regev and Vijayaraghavan, 2017). In the same context, we show that 𝑂̃ (π‘˜π‘‘/πœ–2) samples suffice for any πœ–β‰²1/π‘˜, closing the gap from polynomial to linear, and thus giving the first optimal sample upper bound for the parameter estimation of well-separated Gaussian mixtures. We accomplish this by proving a new result for the Expectation-Maximization (EM) algorithm: we show that EM converges locally, under separation Ξ©(logπ‘˜β€Ύβ€Ύβ€Ύβ€Ύβ€Ύβˆš). The previous best-known guarantee required Ξ©(π‘˜β€Ύβ€Ύβˆš) separation (Yan, et al., 2017). Unlike prior work, our results do not assume or use prior knowledge of the (potentially different) mixing weights or variances of the Gaussian components. Furthermore, our results show that the finite-sample error of EM does not depend on non-universal quantities such as pairwise distances between means of Gaussian components.  more » « less
Award ID(s):
1646522
PAR ID:
10196411
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of Thirty Third Conference on Learning Theory
Volume:
125
Page Range / eLocation ID:
2425-2487
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. We consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least Ξ©(𝑑𝑛), whereas the max margin algorithm has expected standard loss 𝑂(1𝑛). This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is 𝑂(1𝑛). This shows that for very well-separated data, convergence rates of 𝑂(1𝑛) are achievable, which is not the case otherwise. Our results apply to robustness measured in any ℓ𝑝 norm with 𝑝>1 (including 𝑝=∞). 
    more » « less
  2. We provide an end-to-end Renyi DP based-framework for differentially private top-π‘˜ selection. Unlike previous approaches, which require a data-independent choice on π‘˜, we propose to privately release a data-dependent choice of π‘˜ such that the gap between π‘˜-th and the (π‘˜+1)st β€œquality” is large. This is achieved by an extension of the Report-Noisy-Max algorithm with a more concentrated Gaussian noise. Not only does this eliminates one hyperparameter, the adaptive choice of π‘˜ also certifies the stability of the top-π‘˜ indices in the unordered set so we can release them using a combination of the propose-test-release (PTR) framework and the Distance-to-Stability mechanism. We show that our construction improves the privacy-utility trade-offs compared to the previous top-π‘˜ selection algorithms theoretically and empirically. Additionally, we apply our algorithm to β€œPrivate Aggregation of Teacher Ensembles (PATE)” in multi-label classification tasks with a large number of labels and show that it leads to significant performance gains. 
    more » « less
  3. We give efficient algorithms for finding power-sum decomposition of an input polynomial with component s. The case of linear s is equivalent to the well-studied tensor decomposition problem while the quadratic case occurs naturally in studying identifiability of non-spherical Gaussian mixtures from low-order moments. Unlike tensor decomposition, both the unique identifiability and algorithms for this problem are not well-understood. For the simplest setting of quadratic s and , prior work of [GHK15] yields an algorithm only when . On the other hand, the more general recent result of [GKS20] builds an algebraic approach to handle any components but only when is large enough (while yielding no bounds for or even ) and only handles an inverse exponential noise. Our results obtain a substantial quantitative improvement on both the prior works above even in the base case of and quadratic s. Specifically, our algorithm succeeds in decomposing a sum of generic quadratic s for and more generally the th power-sum of generic degree- polynomials for any . Our algorithm relies only on basic numerical linear algebraic primitives, is exact (i.e., obtain arbitrarily tiny error up to numerical precision), and handles an inverse polynomial noise when the s have random Gaussian coefficients. 
    more » « less
  4. We study the problem of learning nonparametric distributions in a finite mixture, and establish tight bounds on the sample complexity for learning the component distributions in such models.Namely, we are given i.i.d. samples from a pdf f where f=w1f1+w2f2,w1+w2=1,w1,w2>0 and we are interested in learning each component fi .Without any assumptions on fi , this problem is ill-posed.In order to identify the components fi , we assume that each fi can be written as a convolution of a Gaussian and a compactly supported density Ξ½i with supp(Ξ½1)∩supp(Ξ½2)=βˆ… .Our main result shows that (1Ξ΅)Ξ©(loglog1Ξ΅) samples are required for estimating each fi . The proof relies on a quantitative Tauberian theorem that yields a fast rate of approximation with Gaussians, which may be of independent interest. To show this is tight, we also propose an algorithm that uses (1Ξ΅)O(loglog1Ξ΅) samples to estimate each fi . Unlike existing approaches to learning latent variable models based on moment-matching and tensor methods, our proof instead involves a delicate analysis of an ill-conditioned linear system via orthogonal functions.Combining these bounds, we conclude that the optimal sample complexity of this problem properly lies in between polynomial and exponential, which is not common in learning theory. 
    more » « less
  5. We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets. 
    more » « less