The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, Õ (d/ϵ2) samples suffice to compute the unknown vectors to within ϵ in Mahalanobis distance, where d is the dimension. In particular, the error rate of the EM based estimator is Õ (dn‾‾√) where n is the number of samples, which is optimal up to logarithmic factors.
more »
« less
Learning mixtures of gaussians with censored data
We study the problem of learning mixtures of Gaussians with censored data. Statistical learning with censored data is a classical problem, with numerous practical applications, however, finite-sample guarantees for even simple latent variable models such as Gaussian mixtures are missing. Formally, we are given censored data from a mixture of univariate Gaussians $$\sum_{i=1}^k w_i \mathcal{N}(\mu_i,\sigma^2),$$ i.e. the sample is observed only if it lies inside a set $$S$$. The goal is to learn the weights $$w_i$$ and the means $$\mu_i$$. We propose an algorithm that takes only $$\frac{1}{\varepsilon^{O(k)}}$$ samples to estimate the weights $$w_i$$ and the means $$\mu_i$$ within $$\varepsilon$$ error.
more »
« less
- Award ID(s):
- 1956330
- PAR ID:
- 10542252
- Publisher / Repository:
- Proceedings of the 40th International Conference on Machine Learning
- Date Published:
- Volume:
- 202
- Page Range / eLocation ID:
- 33396-33415
- Subject(s) / Keyword(s):
- learning theory mixture models sample complexity censored data
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider the problem of learning the qualities w_1, ... , w_n of a collection of items by performing noisy comparisons among them. We assume there is a fixed “comparison graph” and every neighboring pair of items is compared k times. We will study the popular Bradley-Terry-Luce model, where the probability that item i wins a comparison against j equals w_i/(w_i + w_j). We are interested in how the expected error in estimating the vector w = (w_1, ... , w_n) behaves in the regime when the number of comparisons k is large. Our contribution is the determination of the minimax rate up to a constant factor. We show that this rate is achieved by a simple algorithm based on weighted least squares, with weights determined from the empirical outcomes of the comparisons. This algorithm can be implemented in nearly linear time in the total number of comparisons.more » « less
-
Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results along these lines for one of the most fundamental distribution families, Gaussian mixture models. We prove that gradient descent on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1) We show gradient descent with random initialization learns mixtures of two spherical Gaussians in d dimensions with 1/poly(d)-separated centers. 2) We show gradient descent with a warm start learns mixtures of K spherical Gaussians with Ω(log(min(K,d)))-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, the EM algorithm and spectral methodsmore » « less
-
null (Ed.)Abstract We show how to construct a $$(1+\varepsilon )$$ ( 1 + ε ) -spanner over a set $${P}$$ P of n points in $${\mathbb {R}}^d$$ R d that is resilient to a catastrophic failure of nodes. Specifically, for prescribed parameters $${\vartheta },\varepsilon \in (0,1)$$ ϑ , ε ∈ ( 0 , 1 ) , the computed spanner $${G}$$ G has $$\begin{aligned} {{\mathcal {O}}}\bigl (\varepsilon ^{-O(d)} {\vartheta }^{-6} n(\log \log n)^6 \log n \bigr ) \end{aligned}$$ O ( ε - O ( d ) ϑ - 6 n ( log log n ) 6 log n ) edges. Furthermore, for any k , and any deleted set $${{B}}\subseteq {P}$$ B ⊆ P of k points, the residual graph $${G}\setminus {{B}}$$ G \ B is a $$(1+\varepsilon )$$ ( 1 + ε ) -spanner for all the points of $${P}$$ P except for $$(1+{\vartheta })k$$ ( 1 + ϑ ) k of them. No previous constructions, beyond the trivial clique with $${{\mathcal {O}}}(n^2)$$ O ( n 2 ) edges, were known with this resilience property (i.e., only a tiny additional fraction of vertices, $$\vartheta |B|$$ ϑ | B | , lose their distance preserving connectivity). Our construction works by first solving the exact problem in one dimension, and then showing a surprisingly simple and elegant construction in higher dimensions, that uses the one-dimensional construction in a black-box fashion.more » « less
-
Abstract We study a fundamental online job admission problem where jobs with deadlines arrive online over time at their release dates, and the task is to determine a preemptive single-server schedule which maximizes the number of jobs that complete on time. To circumvent known impossibility results, we make a standard slackness assumption by which the feasible time window for scheduling a job is at least $$1+\varepsilon $$ 1 + ε times its processing time, for some $$\varepsilon >0$$ ε > 0 . We quantify the impact that different provider commitment requirements have on the performance of online algorithms. Our main contribution is one universal algorithmic framework for online job admission both with and without commitments. Without commitment, our algorithm with a competitive ratio of $$\mathcal {O}(1/\varepsilon )$$ O ( 1 / ε ) is the best possible (deterministic) for this problem. For commitment models, we give the first non-trivial performance bounds. If the commitment decisions must be made before a job’s slack becomes less than a $$\delta $$ δ -fraction of its size, we prove a competitive ratio of $$\mathcal {O}(\varepsilon /((\varepsilon -\delta )\delta ^2))$$ O ( ε / ( ( ε - δ ) δ 2 ) ) , for $$0<\delta <\varepsilon $$ 0 < δ < ε . When a provider must commit upon starting a job, our bound is $$\mathcal {O}(1/\varepsilon ^2)$$ O ( 1 / ε 2 ) . Finally, we observe that for scheduling with commitment the restriction to the “unweighted” throughput model is essential; if jobs have individual weights, we rule out competitive deterministic algorithms.more » « less
An official website of the United States government

