skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with n > 1 components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary n remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate O(1/\sqrt{t}). This is the first global convergence result for Gaussian mixtures with more than 2 components. The sublinear convergence rate is due to the algorithmic nature of learning over- parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.  more » « less
Award ID(s):
2023166
PAR ID:
10632234
Author(s) / Creator(s):
; ;
Publisher / Repository:
Advances in Neural Information Processing Systems
Date Published:
Volume:
37
ISBN:
9798331314385
Format(s):
Medium: X
Location:
Vancouver, Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. Melo, S. F.; Fang. F. (Ed.)
    Existing risk-averse reinforcement learning approaches still face several challenges, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Transition-based vOlatility-controlled Policy Search (TOPS), a novel algorithm that solves risk-averse problems by learning from transitions. We prove that our algorithm—under the over-parameterized neural network regime—finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient. The convergence rate is comparable to the state-of-the-art risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of TOPS’ performance among existing risk-averse policy search methods. 
    more » « less
  2. In practical instances of nonconvex matrix factorization, the rank of the true solution r^{\star} is often unknown, so the rank rof the model can be over-specified as r>r^{\star}. This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with r=r^{\star} to a sublinear rate when r>r^{\star}. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by \ell_{2} regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime. 
    more » « less
  3. Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of 2 linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy. 
    more » « less
  4. The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, Õ (d/ϵ2) samples suffice to compute the unknown vectors to within ϵ in Mahalanobis distance, where d is the dimension. In particular, the error rate of the EM based estimator is Õ (dn‾‾√) where n is the number of samples, which is optimal up to logarithmic factors. 
    more » « less
  5. We consider using gradient descent to minimize the nonconvex function $$f(X)=\phi(XX^{T})$$ over an $$n\times r$$ factor matrix $$X$$, in which $$\phi$$ is an underlying smooth convex cost function defined over $$n\times n$$ matrices. While only a second-order stationary point $$X$$ can be provably found in reasonable time, if $$X$$ is additionally \emph{rank deficient}, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $$r$$ of the current iterate $$X$$ to be \emph{overparameterized} with respect to the rank $$r^{\star}$ of the global minimizer $$X^{\star}$$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $$r=r^{\star}$$ to a sublinear rate when $$r>r^{\star}$$, even when $$\phi$$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $$X^{\star}$$. 
    more » « less