skip to main content


Title: Critical points and convergence analysis of generative deep linear networks trained with Bures-Wasserstein loss
We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized low-rank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices. The Hessian of this loss at low-rank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights.  more » « less
Award ID(s):
2212520 2145630
NSF-PAR ID:
10463369
Author(s) / Creator(s):
; ; ;
Editor(s):
Krause, Andreas; Brunskill, Emma; Cho, Kyunghyun; Engelhardt, Barbara; Sabato, Sivan; Scarlett, Jonathan
Date Published:
Journal Name:
Proceedings of the 40th International Conference on Machine Learning
Volume:
PMLR 202
Page Range / eLocation ID:
3106-3147
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized low-rank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank- bounded matrices. The Hessian of this loss at low-rank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights. 
    more » « less
  2. Abstract

    We study projection-free methods for constrained Riemannian optimization. In particular, we propose a Riemannian Frank-Wolfe (RFW) method that handles constraints directly, in contrast to prior methods that rely on (potentially costly) projections. We analyze non-asymptotic convergence rates ofRFWto an optimum for geodesically convex problems, and to a critical point for nonconvex objectives. We also present a practical setting under whichRFWcan attain a linear convergence rate. As a concrete example, we specializeRFWto the manifold of positive definite matrices and apply it to two tasks: (i) computing the matrix geometric mean (Riemannian centroid); and (ii) computing the Bures-Wasserstein barycenter. Both tasks involve geodesically convex interval constraints, for which we show that the Riemannian “linear” oracle required byRFWadmits a closed form solution; this result may be of independent interest. We complement our theoretical results with an empirical comparison ofRFWagainst state-of-the-art Riemannian optimization methods, and observe thatRFWperforms competitively on the task of computing Riemannian centroids.

     
    more » « less
  3. Low-rank matrix recovery is a fundamental problem in machine learning with numerous applications. In practice, the problem can be solved by convex optimization namely nuclear norm minimization, or by non-convex optimization as it is well-known that for low-rank matrix problems like matrix sensing and matrix completion, all local optima of the natural non-convex objectives are also globally optimal under certain ideal assumptions. In this paper, we study new approaches for matrix sensing in a semi-random model where an adversary can add any number of arbitrary sensing matrices. More precisely, the problem is to recover a low-rank matrix $X^\star$ from linear measurements $b_i = \langle A_i, X^\star \rangle$, where an unknown subset of the sensing matrices satisfies the Restricted Isometry Property (RIP) and the rest of the $A_i$'s are chosen adversarially. It is known that in the semi-random model, existing non-convex objectives can have bad local optima. To fix this, we present a descent-style algorithm that provably recovers the ground-truth matrix $X^\star$. For the closely-related problem of semi-random matrix completion, prior work [CG18] showed that all bad local optima can be eliminated by reweighting the input data. However, the analogous approach for matrix sensing requires reweighting a set of matrices to satisfy RIP, which is a condition that is NP-hard to check. Instead, we build on the framework proposed in [KLL$^+$23] for semi-random sparse linear regression, where the algorithm in each iteration reweights the input based on the current solution, and then takes a weighted gradient step that is guaranteed to work well locally. Our analysis crucially exploits the connection between sparsity in vector problems and low-rankness in matrix problems, which may have other applications in obtaining robust algorithms for sparse and low-rank problems. 
    more » « less
  4. We propose a unified framework to solve general low-rank plus sparse matrix recovery problems based on matrix factorization, which covers a broad family of objective functions satisfying the restricted strong convexity and smoothness conditions. Based on projected gradient descent and the double thresholding operator, our proposed generic algorithm is guaranteed to converge to the unknown low-rank and sparse matrices at a locally linear rate, while matching the best-known robustness guarantee (i.e., tolerance for sparsity). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We illustrate the application of our framework through two concrete examples: robust matrix sensing and robust PCA. Empirical experiments corroborate our theory. 
    more » « less
  5. We propose unbalanced versions of the quantum mechanical version of optimal mass transport that is based on the Lindblad equation describing open quantum systems. One of them is a natural interpolation framework between matrices and matrix-valued measures via a quantum mechanical formulation of Fisher-Rao information and the matricial Wasserstein distance, and the second is an interpolation between Wasserstein distance and Frobenius norm. We also give analogous results for the matrix-valued density measures, i.e., we add a spatial dependency on the density matrices. This might extend the applications of the framework to interpolating matrix-valued densities/images with unequal masses. 
    more » « less