We consider using gradient descent to minimize the nonconvex function
$f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which
$\phi$ is an underlying smooth convex cost function defined over
$n\times n$ matrices. While only a secondorder stationary point
$X$ can be provably found in reasonable time, if $X$ is additionally
\emph{rank deficient}, then its rank deficiency certifies it as being
globally optimal. This way of certifying global optimality necessarily
requires the search rank $r$ of the current iterate $X$ to be \emph{overparameterized}
with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$.
Unfortunately, overparameterization significantly slows down the convergence
of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear
rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In
this paper, we propose an inexpensive preconditioner that restores
the convergence rate of gradient descent back to linear in the
overparameterized case, while also making it agnostic to possible
illconditioning in the global minimizer $X^{\star}$.
more »
« less
Preconditioned Gradient Descent for OverParameterized Nonconvex Matrix Factorization
In practical instances of nonconvex matrix factorization, the rank of the true solution r^{\star} is often unknown, so the rank rof the model can be overspecified as r>r^{\star}. This overparameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with r=r^{\star} to a sublinear rate when r>r^{\star}. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the overparameterized case, while also making it agnostic to possible illconditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by \ell_{2} regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the overparameterized regime.
more »
« less
 Award ID(s):
 2047462
 NSFPAR ID:
 10314750
 Date Published:
 Journal Name:
 Advances in Neural Information Processing Systems
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We propose a unified framework to solve general lowrank plus sparse matrix recovery problems based on matrix factorization, which covers a broad family of objective functions satisfying the restricted strong convexity and smoothness conditions. Based on projected gradient descent and the double thresholding operator, our proposed generic algorithm is guaranteed to converge to the unknown lowrank and sparse matrices at a locally linear rate, while matching the bestknown robustness guarantee (i.e., tolerance for sparsity). At the core of our theory is a novel structural Lipschitz gradient condition for lowrank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superpositionstructured models. We illustrate the application of our framework through two concrete examples: robust matrix sensing and robust PCA. Empirical experiments corroborate our theory.more » « less

We propose a nonconvex estimator for the covariate adjusted precision matrix estimation problem in the high dimensional regime, under sparsity constraints. To solve this estimator, we propose an alternating gradient descent algorithm with hard thresholding. Compared with existing methods along this line of research, which lack theoretical guarantees in optimization error and/or statistical error, the proposed algorithm not only is computationally much more efficient with a linear rate of convergence, but also attains the optimal statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real data support our theory.more » « less

We consider a deep matrix factorization model of covariance matrices trained with the BuresWasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized lowrank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the BuresWasserstein distance over the space of rank bounded matrices. The Hessian of this loss at lowrank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)We consider a deep matrix factorization model of covariance matrices trained with the BuresWasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized lowrank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the BuresWasserstein distance over the space of rankbounded matrices. The Hessian of this loss at lowrank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights.more » « less