In practical instances of nonconvex matrix factorization, the rank of the true solution r^{\star} is often unknown, so the rank rof the model can be overspecified as r>r^{\star}. This overparameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with r=r^{\star} to a sublinear rate when r>r^{\star}. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the overparameterized case, while also making it agnostic to possible illconditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by \ell_{2} regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the overparameterized regime.
more »
« less
Preconditioned Gradient Descent for Overparameterized Nonconvex BurerMonteiro Factorization with Global Optimality Certification
We consider using gradient descent to minimize the nonconvex function
$f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which
$\phi$ is an underlying smooth convex cost function defined over
$n\times n$ matrices. While only a secondorder stationary point
$X$ can be provably found in reasonable time, if $X$ is additionally
\emph{rank deficient}, then its rank deficiency certifies it as being
globally optimal. This way of certifying global optimality necessarily
requires the search rank $r$ of the current iterate $X$ to be \emph{overparameterized}
with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$.
Unfortunately, overparameterization significantly slows down the convergence
of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear
rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In
this paper, we propose an inexpensive preconditioner that restores
the convergence rate of gradient descent back to linear in the
overparameterized case, while also making it agnostic to possible
illconditioning in the global minimizer $X^{\star}$.
more »
« less
 Award ID(s):
 2047462
 NSFPAR ID:
 10488545
 Publisher / Repository:
 Journal of Machine Learning Research
 Date Published:
 Journal Name:
 Journal of machine learning research
 ISSN:
 15337928
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Recently, there has been significant progress in understanding the convergence and generalization properties of gradientbased methods for training overparameterized learning models. However, many aspects including the role of small random initialization and how the various parameters of the model are coupled during gradientbased updates to facilitate good generalization, remain largely mysterious. A series of recent papers have begun to study this role for nonconvex formulations of symmetric Positive SemiDefinite (PSD) matrix sensing problems which involve reconstructing a lowrank PSD matrix from a few linear measurements. The underlying symmetry/PSDness is crucial to existing convergence and generalization guarantees for this problem. In this paper, we study a general overparameterized lowrank matrix sensing problem where one wishes to reconstruct an asymmetric rectangular lowrank matrix from a few linear measurements. We prove that an overparameterized model trained via factorized gradient descent converges to the lowrank matrix generating the measurements. We show that in this setting, factorized gradient descent enjoys two implicit properties: (1) coupling of the trajectory of gradient descent where the factors are coupled in various ways throughout the gradient update trajectory and (2) an algorithmic regularization property where the iterates show a propensity towards lowrank models despite the overparameterized nature of the factorized model. These two implicit properties in turn allow us to show that the gradient descent trajectory from small random initialization moves towards solutions that are both globally optimal and generalize well.more » « less

null (Ed.)A broad class of unsupervised deep learning methods such as Generative Adversarial Networks (GANs) involve training of overparameterized models where the number of parameters of the model exceeds a certain threshold. Indeed, most successful GANs used in practice are trained using overparameterized generator and discriminator networks, both in terms of depth and width. A large body of work in supervised learning have shown the importance of model overparameterization in the convergence of the gradient descent (GD) to globally optimal solutions. In contrast, the unsupervised setting and GANs in particular involve nonconvex concave minimax optimization problems that are often trained using Gradient Descent/Ascent (GDA). The role and benefits of model overparameterization in the convergence of GDA to a global saddle point in nonconvex concave problems is far less understood. In this work, we present a comprehensive analysis of the importance of model overparameterization in GANs both theoretically and empirically. We theoretically show that in an overparameterized GAN model with a 1layer neural network generator and a linear discriminator, GDA converges to a global saddle point of the underlying nonconvex concave minmax problem. To the best of our knowledge, this is the first result for global convergence of GDA in such settings. Our theory is based on a more general result that holds for a broader class of nonlinear generators and discriminators that obey certain assumptions (including deeper generators and random feature discriminators). Our theory utilizes and builds upon a novel connection with the convergence analysis of linear timevarying dynamical systems which may have broader implications for understanding the convergence behavior of GDA for nonconvex concave problems involving overparameterized models. We also empirically study the role of model overparameterization in GANs using several largescale experiments on CIFAR10 and CelebA datasets. Our experiments show that overparameterization improves the quality of generated samples across various model architectures and datasets. Remarkably, we observe that overparameterization leads to faster and more stable convergence behavior of GDA across the board.more » « less

This paper considers the minimization of a general objective function f (X) over the set of nonsquare n × m matrices where the optimal solution X* is lowrank. To reduce the computational burden, we factorize the variable X into a product of two smaller matrices and optimize over these two matrices instead of X. We analyze the global geometry for a general and yet wellconditioned objective function f (X) whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the strict saddle property. These geometric properties imply that a number of iterative optimization algorithms (such as gradient descent) can provably solve the factored problem with global convergence.more » « less

We study a deep learning inspired formulation for the blind demodulation problem, which is the task of recovering two unknown vectors from their entrywise multiplication. We consider the case where the unknown vectors are in the range of known deep generative models, G(1):R^n→R^l and G(2):R^p→R^l. In the case when the networks corresponding to the generative models are expansive, the weight matrices are random and the dimension of the unknown vectors satisfy l=Omega(n^2+p^2), up to log factors, we show that the empirical risk objective has a favorable landscape for optimization. That is, the objective function has a descent direction at every point outside of a small neighborhood around four hyperbolic curves. We also characterize the local maximizers of the empirical risk objective and, hence, show that there does not exist any other stationary points outside of these neighborhood around four hyperbolic curves and the set of local maximizers. We also implement a gradient descent scheme inspired by the geometry of the landscape of the objective function. In order to converge to a global minimizer, this gradient descent scheme exploits the fact that exactly one of the hyperbolic curve corresponds to the global minimizer, and thus points near this hyperbolic curve have a lower objective value than points close to the other spurious hyperbolic curves. We show that this gradient descent scheme can effectively remove distortions synthetically introduced to the MNIST dataset.more » « less