<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10098817</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 31st Conference On Learning Theory, PMLR</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Srinadh Bhojanapalli</author><author>Nicolas Boumal</author><author>Prateek Jain</author><author>Praneeth Netrapalli</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer–Monteiro factorization approach for solving SDPs. For a large class of SDPs, upon random perturbation of the cost matrix, with high probability, we show that all approximate second-order stationary points are approximate global optima for the penalty formulation of appropriately rank-constrained SDPs, as long as the number of constraints scales sub-quadratically with the desired rank. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices. We particularize our results to two applications, namely, Max-Cut and matrix completion.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Semidefinite programs (SDP) are an important class of optimization problems <ref type="bibr">(Vandenberghe and Boyd, 1996)</ref>, and are critical to several learning-related tasks, including clustering <ref type="bibr">(Shi and Malik, 2000;</ref><ref type="bibr">Abbe, 2017)</ref>, matrix completion and regression <ref type="bibr">(Recht et al., 2010;</ref><ref type="bibr">Cand&#232;s and Recht, 2009)</ref>, kernel learning <ref type="bibr">(Lanckriet et al., 2004)</ref>, and sum-of-squares relaxations <ref type="bibr">(Barak et al., 2015)</ref>.</p><p>However, solving SDPs in practice is a challenging task. Consider the following canonical SDP:</p><p>where C, A 1 , . . . , A m &#8712; R n&#215;n are symmetric matrices, A, B = Tr A T B , and X is positive semidefinite. Such problems are convex and can be solved in polynomial time using classical iterative algorithms such as ellipsoid and interior-point methods <ref type="bibr">(Nesterov et al., 1994)</ref>. However, these algorithms have super-linear complexity (in input size) and tend to scale poorly in practice. As such, they are not well suited for typical learning tasks where both m and n can be fairly large. The two key challenges for these algorithms are: (a) a search space of high dimension on the order of n 2 ; and (b) the need to maintain positive semidefiniteness of the variable matrix X throughout the iterations.</p><p>In response to these challenges, <ref type="bibr">Burer and</ref><ref type="bibr">Monteiro (2003, 2005)</ref> suggested solving (1) by constraining the search space to matrices of rank at most k, using a parameterization of the form X = U U T where U &#8712; R n&#215;k . This reduces the number of variables from O(n 2 ) to O(nk), and mechanically enforces positive semidefiniteness:</p><p>This is equivalent to (1) with the additional constraint rank(X) &#8804; k. This rank constraint is fairly natural, as several SDPs of interest are themselves relaxations of rank-constrained problems. Moreover, <ref type="bibr">Barvinok (1995)</ref>; <ref type="bibr">Pataki (1998)</ref> showed that for every feasible compact SDP, there exists a rank O( &#8730; m) solution that is also globally optimal. While this ensures that the global optimum of the factored SDP problem (with k = &#8486;( &#8730; m)) maps to a global optimum of the original SDP, it is not immediately clear how to solve the factorized problem.</p><p>In fact, the factorized problem is a non-convex quadratically constrained quadratic program which in general can be NP-hard. The challenge in solving the problem arises due to the non-convexity as well as due to constraints. In this work, we consider a simple penalty method approach that replaces the constraints with a quadratic penalty in the objective function. In settings where it is not easy to project onto the constraint set (as in problem (1)), it is natural for optimization algorithms to resort to penalty formulations. This is the strategy of popular methods including the interior point, augmented Lagrangian and primal-dual methods. As a first step we consider the simplest such formulation, namely the quadratic penalty method.</p><p>The proposed penalty formulation is given by:</p><p>where &#181; is generally a large positive constant. Notice that this is a convex problem. Intuitively, for increasingly large &#181;, solutions of (3) converge to solutions of (1).</p><p>Combining that formulation with the Burer-Monteiro factorization we get:</p><p>The cost function L &#181; is non-convex, and generic optimization algorithms can only guarantee computation of an approximate second-order stationary point (SOSP) <ref type="bibr">(Cartis et al., 2012;</ref><ref type="bibr">Ge et al., 2015)</ref>.</p><p>That is, such algorithms converge to a point U where the gradient of L &#181; is small and the Hessian of L &#181; is almost positive semidefinite. Such approximate SOSPs need not be close to optimal in general.</p><p>As a first contribution, we construct an explicit SDP where a suboptimal SOSP exists even for k as large as n -1. However, we show that such bad SDPs have measure zero among all SDPs when k is large enough. Specifically, we show that if the cost matrix has a small amount of randomness and</p><p>We next address the question of approximate optimality for approximate SOSPs, as optimization algorithms can only recover approximate SOSPs in polynomial time. Since there is a measure zero set of SDPs with bad SOSPs, there can be a non-zero (but likely small) measure set of SDPs with bad approximate SOSPs. We use smoothed analysis to avoid these bad SDPs, by perturbing the objective matrix. We show that for k = &#937;( &#8730; m) and with high probability upon an appropriately scaled perturbation, any approximate SOSP of L &#181; with a perturbed objective and bounded residues is approximately optimal for the perturbed and penalized objective (3). We further discuss settings under which all SOSPs of the penalized problem have bounded norm (and hence bounded residues).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Main results</head><p>The main contributions of this work as follows.</p><p>&#8226; We propose a simple penalty version of the factored SDP (2) and show that, for almost all cost matrices C, any exact SOSP of the rank-constrained formulation ( <ref type="formula">4</ref>) is a global optimum for rank &#8486;( &#8730; m)-see Corollary 4. This result removes the smooth manifold requirement of <ref type="bibr">(Boumal et al., 2016)</ref>, though it applies to (3), not (1).</p><p>&#8226; We show that there indeed exists a compact, feasible SDP with a worst-case C for which the penalized, factorized problem admits a suboptimal SOSP (see Theorem 5), even for rank almost as big as the dimension.</p><p>&#8226; We show that by perturbing the objective function slightly and by performing a smoothed analysis on the resulting problem, we can guarantee with high probability that every approximate SOSP of the perturbed problem is an approximate global optimum of the perturbed and penalized SDP. Hence, we can use standard techniques <ref type="bibr">(Cartis et al., 2012;</ref><ref type="bibr">Ge et al., 2015)</ref> to find approximate SOSPs and bound the optimality gap-see Theorem 13.</p><p>In summary, we show that for a class of SDPs with bounded solutions, we can find a low-rank solution that is close to global optimality for the penalized objective. Note that finding the smallest rank matrix satisfying a set of linear equations is NP hard <ref type="bibr">(Natarajan, 1995)</ref>. Our results show how increasing the number of parameters (rank) makes the optimization of this non-convex problem easier. While the extreme case setting k = n makes the constraint trivial, our results show optimality for a non-trivial rank ( &#937;( &#8730; m)), and it is an interesting question to understand this trade-off in more detail. We believe that the factorization technique can be leveraged to design faster SDP solvers, and any looseness in the current bounds is an artifact of our proof, which hopefully can be tightened in future works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Prior work</head><p>Fast solvers for SDPs have garnered interest in the optimization and in the theoretical computer science communities for a long time. Most of the existing results for SDP solvers can be categorized into direct (convex) methods and factorization methods.</p><p>Convex methods: Classical techniques such as interior point methods <ref type="bibr">(Nesterov and</ref><ref type="bibr">Nemirovski, 1989, 1988;</ref><ref type="bibr">Alizadeh, 1995)</ref> and cutting plane methods <ref type="bibr">(Anstreicher, 2000;</ref><ref type="bibr">Krishnan and Mitchell, 2003)</ref> enjoy geometric convergence, but their computational complexity per iteration is high. As a result, it is hard to scale these methods to SDPs with a large number of variables.</p><p>With the goal of speeding up the computation, many works have considered: i) a specific and important class of SDPs, namely, SDPs with a trace constraint (Tr (X) = 1), and ii) methods with sub-linear convergence. For these SDPs, <ref type="bibr">Arora et al. (2005)</ref> proposed a multiplicative weights method which provides faster techniques for some graph problems, with running time depending on O( <ref type="formula">1</ref>2 ) and the width of the problem. <ref type="bibr">Hazan (2008)</ref> proposed a Frank-Wolfe-type algorithm with a complexity of &#213;( Z 3.5 ) where Z is the sparsity of C and the A i 's. <ref type="bibr">Garber and Hazan (2016)</ref>; <ref type="bibr">Garber (2016)</ref> proposed faster methods that either remove the dependence on Z (sub-linear time), or improve the dependence on . While these methods improve the per iteration complexity, they still need significant memory as the rank of solutions for these methods is not bounded, and scales at least at the rate of O( <ref type="formula">1</ref>). An exception to this is the work by <ref type="bibr">Yurtsever et al. (2017)</ref>, which uses sketching techniques in combination with conditional gradient method to maintain a low rank representation. However this method is guaranteed to find a low rank optimum only if the conditional gradient method converges to a low rank solution.</p><p>Factorization methods: <ref type="bibr">Burer and</ref><ref type="bibr">Monteiro (2003, 2005)</ref> proposed a different approach to speed up computations, namely by searching for solutions with smaller rank. Even though all feasible compact SDPs have at least one solution of rank O( &#8730; m) <ref type="bibr">(Barvinok, 1995;</ref><ref type="bibr">Pataki, 1998)</ref>, it is not an easy task to optimize directly on the rank-constrained space because of non-convexity. However, <ref type="bibr">Burer and</ref><ref type="bibr">Monteiro (2003, 2005)</ref> showed that any rank-deficient local minimum is optimal for the SDP. <ref type="bibr">Journ&#233;e et al. (2010)</ref> extended this result to any rank-deficient SOSP under restrictive conditions on the SDP. However, these results cannot guarantee that SOSPs are rank deficient, or at least that rank-deficient SOSPs can be computed efficiently (or even exist). <ref type="bibr">Boumal et al. (2016</ref><ref type="bibr">Boumal et al. ( , 2018b) )</ref> address this issue by showing that for a particular class of SDPs satisfying some regularity conditions, and for almost all cost matrices C, any SOSP of the rank-constrained problem with k = &#8486;( &#8730; m) is a global optimum. Later, <ref type="bibr">Mei et al. (2017)</ref> showed that for SDPs with elliptic constraints (similar to the Max-Cut SDP), any rank-k SOSP gives a (1 -1 k-1 ) approximation to the optimum value. Both these results are specific to particular classes of SDPs and do not extend to general problems.</p><p>In a related setup, <ref type="bibr">Keshavan et al. (2010)</ref>; <ref type="bibr">Jain et al. (2013)</ref> have showed that rank-constrained matrix completion problems can be solved using smart initialization strategies followed by local search methods. Following this, many works have identified interesting statistical conditions under which certain rank-constrained matrix problems have no spurious local minima <ref type="bibr">(Sun et al., 2016;</ref><ref type="bibr">Bandeira et al., 2016;</ref><ref type="bibr">Ge et al., 2016;</ref><ref type="bibr">Bhojanapalli et al., 2016b;</ref><ref type="bibr">Park et al., 2017;</ref><ref type="bibr">Ge et al., 2017;</ref><ref type="bibr">Zhu et al., 2017;</ref><ref type="bibr">Ge and Ma, 2017)</ref>. These results are again for specific problems and do not extend to general SDPs.</p><p>In contrast, our result holds for a large class of SDPs in penalty form, without strong assumptions on the constraint matrices A i and for a large class of cost matrices C. We avoid degenerate SDPs with spurious local minima by perturbing the problem and then using a smoothed analysis, which is one of the main contribution of the work.</p><p>After the initial preprint of this work became available, we learned about work by Du and Lee (2018), who in a parallel have shown optimality of exact SOSPs using similar techniques as we used in Section 2, but for single hidden layer neural networks with quadratic activations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notation</head><p>For a smooth function f (X), we refer to first-order stationary points X as FOSPs. Such points satisfy &#8711;f (X) = 0 (zero gradient). We refer to second-order stationary points at SOSPs. Such points are FOSPs and furthermore satisfy &#8711; 2 f (X) 0, i.e., the Hessian is positive semidefinite. The set of symmetric matrices of size n is S n&#215;n . &#963; i () and &#955; i () denote the ith singular-and eigenvalues respectively, in decreasing order.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Exact second-order points typically are optimal</head><p>In this section, we study the second-order stationary points of our penalty formulation (4) and show that for "typical" cost matrices C, exact SOSPs are optimal for (4) as long as k = &#8486;( &#8730; m). Our result is based on a simple but powerful argument that has appeared in various forms before, notably in <ref type="bibr">(Burer and Monteiro, 2005)</ref>. The argument claims that any rank-deficient local optimum of (4) (which is really a parameterized version of (3) with a rank constraint) should map to a local optimum of (3) as the constraint rank(X) &#8804; k is not active. Since (3) is convex, every local optimum is a global optimum, hence a rank-deficient local optimum of (4) maps to a global optimum of (3). Interestingly, the result holds even if U is just an SOSP rather than a local optimum, something that is readily apparent from the proofs in <ref type="bibr">(Journ&#233;e et al., 2010)</ref>, albeit in a restricted setting.</p><p>(5)</p><p>Now consider the rank-constrained factorized version of the problem:</p><p>If U is an SOSP of (6) with rank(U ) &lt; k, then U is a global minimum of (6) and U U T is a global minimum of (5). (Notice that such a point may not exist in general.)</p><p>See Appendix D for a detailed proof. Thus, (column) rank-deficient SOSPs of (4) are globally optimal and map to global optima of (3). A direct corollary states that non-convexity is benign if k = n + 1.</p><p>Corollary 2 Given an SDP in penalized and factorized form (4) with k &gt; n, deterministically, any SOSP U is a global optimum, and U U T is a global optimum for (3).</p><p>Yet, the main goal is to make a statement for small k, so as to reduce the dimensionality of the search space. Unfortunately, in general, SOSPs of non-convex cost functions need not have rank less than k for arbitrary k.</p><p>However, the following lemma asserts that, for almost all cost matrices C, provided k grows like &#8730; m, all FOSPs (a fortiori, all SOSPs) are rank deficient. Our proof is the same as that of <ref type="bibr">(Boumal et al., 2016, Lemma 9)</ref> but the main statement as well as the cost function and conditions on constraints are stated more generally. In particular, unlike the statement in that reference, we do not require that the feasible set of (2) form a smooth manifold.</p><p>See Appendix D for a detailed proof.</p><p>These two lemmas lead to an important corollary regarding the factorization approach.</p><p>Corollary 4 Given an SDP in penalized and factorized form (4) with k such that k(k+1)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2</head><p>&gt; m, for almost any cost matrix C, deterministically, any SOSP U is a global optimum, and U U T is a global optimum for (3).</p><p>To ensure existence of such solutions, it is necessary to include additional conditions (for example, on the constraints of the SDP.) From <ref type="bibr">(Pataki, 1998;</ref><ref type="bibr">Barvinok, 1995)</ref>, it is known that SDPs with non-empty, compact search spaces can have a unique solution of rank up to the maximal k such that</p><p>&#8804; m. This indicates that, in general, the condition on k cannot be improved. These observations lead to the following two natural questions:</p><p>1. This result holds only for "typical" C. Is this an artifact of our proof technique, or is it necessary to exclude a zero-measure set of cost matrices C?</p><p>2. This result holds only for exact SOSPs, which in general are hard to compute. Numerical methods tend to provide approximate SOSPs only. Can we extend the results to approximate SOSPs as well?</p><p>The next section answers the first question in the affirmative: there do exist "bad" matrices C for some SDPs, so that any result of the type of Corollary 4 must exclude at least some SDPs. To address the second question, we resort to smoothed analysis, that is, for large classes of SDPs in penalty form, upon perturbing the cost matrix randomly, we show that, with high probability, approximate SOSPs are also good enough to obtain approximately globally optimal solutions of the perturbed problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Exact second-order points sometimes are suboptimal</head><p>Below, we construct an SDP which confirms that it is indeed necessary (in full generality) to exclude some SDPs in Corollary 4, even if k is allowed to grow large.</p><p>Pick n &#8805; 3 and set = 6 n-1 . Consider the following m = n + 1 constraint matrices in S n&#215;n :</p><p>where e i &#8712; R n is the ith standard basis vector (the ith column of I n ). In words, for i = 1, . . . , n -1, each A i has only two non-zero entries-both equal to one-located in row i of the last column and symmetrically in column i of the last row. Pair these matrices with the right-hand side vector b &#8712; R m defined by</p><p>Finally, set the cost matrix C to be zero. (A distinct advantage of picking C = 0 is that it makes the choice of &#181; &gt; 0 irrelevant in defining (4).) These prescriptions fully define the SDP (1) and its associated factorized and penalized problem (4), which we can write here as:</p><p>Theorem 5 The SDP defined above admits a global optimum of rank 1. Furthermore, for k = n -1,</p><p>See Appendix E for a detailed proof of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Approximate second-order points: smoothed analysis</head><p>Recall that Corollary 4 shows that exact SOSPs of (4) are optimal for almost all cost matrices C. However, obtaining exact SOSPs is challenging in practice. Standard optimization algorithms such as the trust-region method and the cubic regularization method <ref type="bibr">(Nesterov and Polyak, 2006;</ref><ref type="bibr">Cartis et al., 2012)</ref>, when run for finitely many iterations, converge to an approximate SOSP only, as defined below. All proofs for this section are in Appendix F.</p><p>As an extension to Lemma 1-which states rank-deficient exact SOSPs are optimal-we now show that approximate SOSPs which are also approximately rank deficient are indeed approximately optimal. To this end, we define the linear operator A : S n&#215;n &#8594; R m with A(X) i = A i , X . We use the following notion of norm for A:</p><p>Furthermore, we define the residue at a point U to be the vector of constraint violations:</p><p>Furthermore, if a global optimum X for (3) exists, then the optimality gap obeys:</p><p>(Once again, we stress that U and X as prescribed may not exist in general.)</p><p>To reach a statement about approximate optimality of approximate SOSPs, it remains to show that approximate FOSPs are approximately rank deficient. Such a result would constitute a generalization of Lemma 3. In that lemma, we had to exclude a pathological set of "bad" matrices C. Hence, here too, we expect to encounter difficulties with some C's.</p><p>For this reason, we resort to a smoothed analysis. That is: on the off-chance that the cost matrix C is "bad", we perturb it with a random Gaussian matrix. We further assume that (a) k is large enough, and (b) approximate FOSPs have bounded residues r. That residues are indeed bounded is established under special conditions in later subsections.</p><p>Theorem 9 Draw a random matrix G with G ij &#8764; N (0, &#963; 2 G ) i.i.d. for i &#8804; j and G = G T . Let U &#8712; R n&#215;k be an -FOSP of (4) with perturbed cost matrix C + G. Assume there exists a constant B which only depends on the problem parameters A, b, C and &#181; such that:</p><p>1. With probability at least 1 -&#948; on the choice of G, all -FOSPs of the perturbed problem have bounded residue: r 2 &#8804; B, and</p><p>for some &#948; &#8712; (0, 1), where c 0 is a universal constant.</p><p>Then, with probability at least 1 -&#948; -&#948; ,</p><p>Crucially, notice that rank(A) &#8804; m, so (up to log factors) k is required to grow like &#8730; m, as desired.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Compact SDPs</head><p>To leverage Lemma 8 and Theorem 9, we must control the residues at approximate FOSPs of (4). This is delicate in general. In this part, we make the following assumption.</p><p>Assumption 10 The search space C = {X 0 : A(X) = b} of the SDP (1) is non-empty and compact, where A : S n&#215;n &#8594; R m is the linear operator defined by A(X) i = A i , X .</p><p>When this is the case, standard results from <ref type="bibr">(Barvinok, 1995;</ref><ref type="bibr">Pataki, 1998)</ref> guarantee the existence of a global optimum of rank r where r(r+1) 2 &#8804; m for the SDP (1)-always. It is reasonable to expect such low-rank solutions might also exist for the penalized problem (3), and that one should be able to compute these by solving the factorized problem (4)-at least, generically. This section is about making these expectations precise in the soft case, where one only computes approximate SOSPs.</p><p>A technical necessity in our proofs is to show that FOSPs of (4) have bounded norm. To do this, we need a technical modification of (4). Specifically, consider the following geometric fact.</p><p>Proposition 11 For a given SDP (1), assume C is non-empty. Then, C is compact if and only if there exists a positive definite matrix A 0 and a nonnegative real b 0 such that A 0 , X = b 0 for all X &#8712; C. Furthermore, unless C = {0}, b 0 &gt; 0.</p><p>Thus, under Assumption 10, we can rewrite (1) with an explicit redundant constraint involving A 0 0: minimize</p><p>Accordingly, we define &#195; : S n&#215;n &#8594; R m+1 and b &#8712; R m+1 such that &#195;(X) i = A i , X for i = 0, . . . , m, and C = {X 0 : &#195;(X) = b}. With the extended residue definition</p><p>the associated penalty formulations are:</p><p>We note that, in full generality, finding (A 0 , b 0 ) as in Proposition 11 may be as hard as solving an SDP, but in practical applications (A 0 , b 0 ) may be easy to determine. In particular, SDPs with a trace constraint satisfy this with A 0 = I n . For example, for the Max-Cut SDP, feasible matrices have constant trace n, so that A 0 = I n and b 0 = n are suitable. For this modified formulation, approximate FOSPs have bounded norm and bounded residues.</p><p>Lemma 12 Consider problem (13) with A 0 0 and b 0 &#8805; 0. For any U ,</p><p>.</p><p>We are now ready to state the main result by connecting Lemma 8 and Theorem 9 via Lemma 12.</p><p>Let B 0 &#195; max</p><p>Theorem 13 (Global optimality.) Let X be a global optimum of (12). Let &#948; &#8712; (0, 1) and c 0 be a universal constant. Draw a random matrix G with G ij &#8764; N (0, &#963; 2 G ) i.i.d. for i &#8804; j and G = G T . Let U &#8712; R n&#215;k be an ( , &#947;)-SOSP of (13) with perturbed cost matrix C + G and:</p><p>Then, with probability at least 1 -O(&#948;) the optimality gap obeys:</p><p>This result shows that for compact SDPs (10), for k = &#937;( &#8730; m), with high probability upon the perturbation, all approximate SOSPs of the perturbed factorized problem are approximately optimal.</p><p>Notice that the result requires smaller than &#963; G , which is limiting but unavoidable as there can be SDPs with bad approximate SOSPs. Hence, if we perturb by only a small amount (small &#963; G ), then we need to find highly accurate SOSPs to avoid these bad approximate SOSPs. Another way to look at the result is to see &#963; G as a tentative distance from bad SDPs. Hence, for SDPs far away from these bad problems (higher &#963; G ), even high solutions are approximately ( -) optimal.</p><p>Setting &#181; : From equation ( <ref type="formula">14</ref>), to get residues of magnitude less than some (approximate feasibility) at an approximate SOSP U , we need &#181; to be O</p><p>Using penalty formulations such as ALM can help us in getting similar approximate feasiblity for much smaller values of &#181;. For general penalty problems we refer to Proposition 2.4 of <ref type="bibr">(Bertsekas, 2014)</ref> which gives the relationship between the value of &#181; and the feasibility of a solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">SDPs with positive definite cost</head><p>We now consider a second class of SDPs: ones where the cost matrix C is positive definite. The feasible set of these SDPs need not be compact. However, FOSPs for these SDPs are bounded, hence we will be able to show similar results as in Section 4.1. Consider the penalty formulation of the perturbed problem, minimize</p><p>where G is a symmetric random matrix with G ij i.i.d.</p><p>&#8764; N (0, &#963; 2 G ) for i &#8804; j. Let F &#181; (U U T ) = L &#181; (U ). To prove an optimality result for this problem, we first show a residue bound for any -FOSP of L &#181; (U ). . Then, with probability at least 1 -&#948;, at any -FOSP U of (15), the residue obeys:</p><p>Using this, we get the following result from Lemma 8 and Theorem 9 along same lines as that of Theorem 13. Let</p><p>Theorem 15 (Global optimality.) Let &#948; &#8712; (0, 1) and c 0 be a universal constant. Given an SDP (1) with positive definite objective matrix C, let X be a global optimum of the perturbed problem (15), and let</p><p>. Let U be an ( , &#947;)-SOSP of the perturbed problem (15) with:</p><p>Then, with probability at least 1 -O(&#948;),</p><p>This result shows that even though the feasible set of SDP is not compact, as long as the objective is positive definite, all approximate SOSPs of the perturbed objective are approximately optimal (with high probability upon the perturbation). Without the positive definite condition, SDPs can have unbounded solutions (see Section 2.4 of <ref type="bibr">G&#228;rtner and Matousek (2012)</ref>). We also require a bound on the magnitude of the perturbation (&#963; G ), as otherwise the objective (C + G) can be indefinite with (too) high probability, which may result in unbounded solutions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Applications</head><p>We present applications of our results to two SDPs: Max-Cut (deferred to Section A) and matrix completion, both of which are important problems in the learning domain and have been studied extensively. Interest has grown to develop efficient solvers for these SDPs <ref type="bibr">(Arora and Kale, 2007;</ref><ref type="bibr">Mei et al., 2017;</ref><ref type="bibr">Hardt, 2013;</ref><ref type="bibr">Bandeira et al., 2016)</ref>.</p><p>This work differs from previous efforts in at least two ways. First, we aim to demonstrate that Burer-Monteiro-style approaches, which are often used in practice, can indeed lead to provably efficient algorithms for general SDPs. We believe that building upon this work, it should be possible to improve the time-complexity guarantees of such factorization-based algorithms. Second, we note that several problems formulated as SDPs in fact necessitate low-rank solutions, for example because of memory concerns (as is the case in matrix completion), and factorization approaches provide a natural means to control rank.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Matrix Completion</head><p>In this section we specialize our results for the matrix completion problem Cand&#232;s and <ref type="bibr">Recht (2009)</ref>. The goal of a matrix completion problem is to find a low-rank matrix M using only a small number of its entries, with applications in recommender systems. To ensure that the computed matrix is low-rank and generalizes well, one typically imposes nuclear-norm regularization which leads to the following SDP:</p><p>Here S is the set of observed indices of M and Z</p><p>be the corresponding penalty objective. Let F &#181; (U U T ) = L &#181; (U ). The objective is positive definite with &#955; 1 (C) = &#955; n (C) = 1. Also, since A is a sub-sampling operator, A &#8804; 1. Finally, for 2 (i,j)&#8712;S M 2 ij , the residues are bounded by:</p><p>Applying Theorem 15 for this setting gives the following corollary.</p><p>Corollary 16 There exists an absolute numerical constant c 2 such that the following holds. With probability greater than 1 -&#948;, every ( , &#947;)-SOSP U of the perturbed matrix completion problem L &#181; (U ) (16) with:</p><p>, and</p><p>, where X * is a global optimum of F &#181; (X). This result shows that for the matrix completion problem with m observations, for rank &#937;( &#8730; m), with high probability upon perturbation, any approximate local minimum of the factorized and penalized problem is an approximate global minimum.</p><p>Most of the existing results on matrix completion either require strong distribution assumptions on S and incoherence assumptions on M to recover a low-rank solution (Cand&#232;s and <ref type="bibr">Recht, 2009;</ref><ref type="bibr">Jain et al., 2013)</ref>. The standard nuclear norm minimization algorithms are not guaranteed to converge to low-rank solutions without these assumptions, which implies that the entire matrix would need to be stored for prediction which is infeasible in practice. Similarly, generalization error bounds <ref type="bibr">(Foygel and Srebro, 2011)</ref> as well as differential privacy guarantees depend on recovery of a low-rank solution.</p><p>Our result guarantees finding a solution of rank &#937;( &#8730; m) without any statistical assumptions on the sampling or the matrix, unlike existing results <ref type="bibr">(Ge et al., 2016)</ref>, though it involves a random perturbation. The tradeoff is our results does not guarantee finding a lower (potentially a constant) rank solution, even if one exists for a given problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and perspectives</head><p>In this paper we considered the Burer-Monteiro factorization to solve SDPs (2) in the penalty form which allows us to find solutions of rank at most k without doing any eigen-decomposition. We established that with a small perturbation to the objective C, with high probability, all approximate local minima are approximately globally optimal for the SDP, provided k = &#937;( &#8730; m) (which is the right order though constants and dependence on other parameters could certainly be improved). This is achieved through smoothed analysis, which we believe is an appropriate tool to deal with the pathological cases exhibited in Section 3. A natural direction of improvement for the present work is to tackle the more complex ALM formulations, which is expected to help in satisfying constraints more accurately.</p><p>Finally, we studied the applicability of our results to two applications: Max-Cut (in the appendix) and matrix completion. While these particularizations do not always improve over the specialized solvers for these problems, we believe that the work done here in studying low-rank parameterization of SDPs will be a helpful step towards building up to faster methods. Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank matrix optimization. arXiv preprint arXiv:1702.07945, 2017.</p><p>We can use the above corollary to prove Lemma 21. Proof In our case, entries of G have variance &#963; 2 G . Thus, set G = &#963; G G , and set M = &#963; G M . From Corollary 20, we get</p><p>2 , and = 1 2c . Substituting this we get with probability at least 1-exp</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix D. Proofs for Section 2</head><p>Proof [Proof of Lemma 1] Necessary and sufficient optimality conditions for (5) are: &#8711;f (X) 0 and &#8711;f (X)X = 0. Let U be an SOSP for (6) with rank(U ) &lt; k and define X = U U T . Then, &#8711;g(U ) = 2&#8711;f (U U T )U = 0 and &#8711; 2 g(U ) 0. The first statement readily shows that &#8711;f (X)X = 0. The Hessians of f and g are related by:</p><p>Since rank(U ) &lt; k, there exists a vector z &#8712; R k such that U z = 0 and z 2 = 1. For any x &#8712; R n , set U = xz T so that U U T + U U T = 0. Using second-order stationarity of U , we find:</p><p>This holds for all x &#8712; R n , hence &#8711;f (U U T ) 0 and X = U U T is optimal for (5). Since ( <ref type="formula">5</ref>) is a relaxation of (6), it follows that U is optimal for (6).</p><p>Proof [Proof of Lemma 3] Let U be any FOSP of (4) and consider the linear operator A : S n&#215;n &#8594; R m defined by A(X) i = A i , X . By first-order stationarity, we have:</p><p>Hence, the nullity of C + 2&#181;A * (A(U U T ) -b) (the dimension of its kernel) satisfies:</p><p>The maximum over y is indeed attained since the function null takes integer values in 0, . . . , n. Say the maximum evaluates to . Then, for some y, M C + A * (y) has nullity . Hence,</p><p>Thus, the trace of X 0 is bounded, and it follows that C is compact. Furthermore: if b 0 = 0, then C = {0}; and if b 0 &gt; 0, then 0 / &#8712; C. To prove the other direction, assume C is non-empty and compact. If C = {0}, let A 0 = I n , b 0 = 0. Now assume C = {0}. The SDP comes in a primal-dual pair:</p><p>It is well known that if (D) is infeasible, then (P) is unbounded or infeasible <ref type="bibr">(Wolkowicz, 1981, Thm. 4.1(a)</ref>). Since we assume C is non-empty, this simplifies to: if (D) is infeasible, then (P) is unbounded. The contrapositive states: if (P) is bounded, then (D) is feasible. By our compactness assumption on C, we know that (P) is bounded for all C &#8712; S n&#215;n . Thus, (D) is feasible for any C. In particular, take C = -I n : there exists -y &#8712; R m such that A 0 A * (y) I n . Furthermore,</p><p>Since there exists X = 0 in C, it follows that b 0 &gt; 0.</p><p>Proof [Proof of Theorem 9] Using ( <ref type="formula">19</ref>), U is an -FOSP of the perturbed problem if and only if</p><p>Hence, we control the smallest singular value of U in terms of and the k smallest singular values of M + G:</p><p>The next lemma helps lower-bound the denominator-it follows from Theorem 1.16 and Corollary 1.17 in (Nguyen, 2017); see proof in Appendix C.</p><p>Lemma 21 Let M be a fixed symmetric matrix of size n. Let G be a symmetric Gaussian matrix of size n, independent of M , with diagonal and upper-triangular entries sampled independently from N (0, &#963; 2 G ). There exists an absolute constant c 0 such that:</p><p>We cannot use Lemma 21 directly, as in our case M is not statistically independent of G. Indeed, M depends on U through the residue r = r(U ) and U is an -FOSP: a feature that depends on G. To resolve this, we cover the set of possible M 's with a net, under the assumption that r is bounded.</p><p>Lemma 21 provides a bound for each M in this net. This can be extended to hold for all M 's in the net simultaneously via a union bound. By taking a sufficiently dense net, we can then infer that M is necessarily close to one of these M 's, and conclude.</p><p>Let E be the event (on G) that r 2 &#8804; B for all -FOSPs of the perturbed problem. Conditioned on E, we have</p><p>where A is defined in (8). As a result, M lies in a ball of center C and radius 2&#181;B A in an affine subspace of dimension rank(A). A unit-ball in Frobenius norm in d dimensions admits an &#949;-net of (1 + 2/&#949;) d points <ref type="bibr">(Vershynin, 2016, Cor. 4.2.13</ref>). Thus, we can pick a net with</p><p>points in such a way that, independently of r, there exists a point M in the net satisfying:</p><p>Let T : S n&#215;n &#8594; R k be defined by T q (A) = (&#963; n-q+1 (A), . . . , &#963; n (A)) T , that is: T extracts the q smallest singular values of A, in order. Then,</p><p>where the first inequality follows from <ref type="bibr">(Bhatia, 2013, Ex. IV.3.5)</ref>. Hence,</p><p>Now, taking a union bound for E and for Lemma 21 over each M in the net, we get (24) and</p><p>with probability at least</p><p>Inside the log, we can safely replace k with 1, as this only hurts the probability. Then, the result holds with probability at least</p><p>We aim to pick k so as to ensure</p><p>This is a quadratic condition of the form</p><p>, which is satisfied for,</p><p>Combining ( <ref type="formula">23</ref>), ( <ref type="formula">24</ref>), ( <ref type="formula">25</ref>) and ( <ref type="formula">26</ref>), we find:</p><p>Proof [Proof of Lemma 12] If U = 0, the bounds clearly hold: assume U = 0 in what follows.</p><p>Using &#8711; L&#181; (U ) = 2(C + 2&#181; &#195; * (r))U , the definition of -FOSP reads:</p><p>Combining this with A F &#8805; 1 B F A, B for B = 0 (Cauchy-Schwarz) gives:</p><p>This can be further developed as:</p><p>At this point, we separate the constraint (A 0 , b 0 ) from the rest, using the usual definition for (A, b) which capture constraints 1, . . . , m:</p><p>Let y = A(U U T ) 2 . Then the above inequality holds when</p><p>For this to happen we need the above quadratic to have real roots. This requires:</p><p>where we used that for any two matrices A and B, it holds that AB F &#8804; A 2 B F . Focus on the last two terms of the last inequality. We distinguish two cases. Either</p><p>Or the opposite holds, and:</p><p>This is a quadratic inequality in y = U 2 F of the form ay 2 -by -c &#8804; 0 with coefficients a &gt; 0 and b, c &#8805; 0. Such a quadratic always has at least one real root, so that y &#8804; b+ &#8730; b 2 +4ac 2a</p><p>. Furthermore,</p><p>Hence, y &#8804; b a + c a , which means:</p><p>.</p><p>Accounting for the two distinguished cases, we find:</p><p>.</p><p>We now bound the residues (generically) in terms of U F , using submultiplicativity for U U T F &#8804; U 2 F and the definition of A (8):</p><p>Evidently, the same bound holds for &#195;, b, r.</p><p>Proof [Proof of Theorem 13] By Lemma 12, for a problem perturbed with G, the residues of all -FOSPs, r 2 , are bounded as:</p><p>With probability at least 1 -&#948;, C + G 2 &#8804; C 2 + 3&#963; G &#8730; n + 2 log(1/&#948;) . Hence, Theorem 9 applies with this &#948; and</p><p>Hence, with k as prescribed in that theorem for a given &#948; = &#948; &#8712; (0, 1), with probability at least 1 -2&#948;, it holds that Hence,</p><p>The above inequality is a quadratic in y = A(U U T ) 2 : y 2 -y b 2 + 1 2&#181;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4</head><p>U 2 F , then U F &#8804; 2 &#955;n(C) . Else, for the above inequality to hold we need the quadratic to have real roots.</p><p>We now show that the Hessian is &#961;-Lipschitz continuous in operator norm, that is, we must show that for any U 1 and U 2 with norms bounded by &#964; , max</p><p>Recall from (21) that</p><p>Hence,</p><p>On one hand, following the same reasoning as in (28), we have</p><p>On the other hand, using that for any two vectors u, v we have</p><p>we can find:</p><p>For this, we used A(U U T + U U T ) 2 &#8804; A U U T + U U T F &#8804; &#964; A U F when U F &#8804; &#964; and</p><p>Overall, this shows &#961; = 16&#181;&#964; A 2 is an appropriate Lipschitz constant.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#169; 2018 S. Bhojanapalli, N. Boumal, P. Jain &amp; P. Netrapalli.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>We pick &#963;G first and then B0, k and .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>This is vanilla gradient descent but with additional random noise added to the updates when the gradient magnitude becomes smaller than a threshold.</p></note>
		</body>
		</text>
</TEI>
