<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Manifold Proximal Point Algorithms for Dual Principal Component Pursuit and Orthogonal Dictionary Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10294679</idno>
					<idno type="doi">10.1109/TSP.2021.3099643</idno>
					<title level='j'>IEEE Transactions on Signal Processing</title>
<idno>1053-587X</idno>
<biblScope unit="volume">69</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shixiang Chen</author><author>Zengde Deng</author><author>Shiqian Ma</author><author>Anthony Man-Cho So</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We consider the problem of minimizing the 1 norm of a linear map over the sphere, which arises in various machine learning applications such as orthogonal dictionary learning (ODL) and robust subspace recovery (RSR). The problem is numerically challenging due to its nonsmooth objective and nonconvex constraint, and its algorithmic aspects have not been well explored. In this paper, we show how the manifold structure of the sphere can be exploited to design fast algorithms with provable guarantees for tackling this problem. Specifically, our contribution is fourfold. First, we present a manifold proximal point algorithm (ManPPA) for the problem and show that it converges at a global sublinear rate. Furthermore, we show that ManPPA can achieve a local quadratic convergence rate when applied to sharp instances of the problem. Second, we develop a semismooth Newton-based inexact augmented Lagrangian method for computing the search direction in each iteration of ManPPA and show that it has an asymptotic superlinear convergence rate. Third, we propose a stochastic variant of ManPPA called StManPPA, which is well suited for large-scale computation, and establish its sublinear convergence rate. Both ManPPA and StManPPA have provably faster convergence rates than existing subgradient-type methods. Fourth, using ManPPA as a building block, we propose a new heuristic method for solving a matrix analog of the problem, in which the sphere is replaced by the Stiefel manifold. The results from our extensive numerical experiments on the ODL and RSR problems demonstrate the efficiency and efficacy of our proposed methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The problem of finding a subspace that captures the features of a given dataset and possesses certain properties is at the heart of many machine learning applications. One commonly encountered formulation of the problem, which is motivated largely by sparsity or robustness considerations, is given by</p><p>where Y &#8712; R n&#215;p is a given matrix and &#8226; r denotes the r norm of a vector. To better understand how problem <ref type="bibr">(1)</ref> arises in applications, let us consider two representative examples.</p><p>&#8226; Orthogonal Dictionary Learning (ODL). The goal of ODL is to find an orthonormal basis that can compactly represent a given set of p (p n) data points y 1 , . . . , y p &#8712; R n . Such a problem arises in many signal and image processing applications; see, e.g., <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref> and the references therein. By letting Y = [y 1 , . . . , y p ] &#8712; R n&#215;p , the problem can be understood as finding an orthogonal matrix X &#8712; R n&#215;n and a sparse matrix A &#8712; R n&#215;p such that Y &#8776; XA. Noting that this means X Y &#8776; A should be sparse, one approach is to find a collection of sparse vectors from the row space of Y and apply some post-processing procedure to the collection to form the orthogonal matrix X. This has been pursued in various works; see, e.g., <ref type="bibr">[3]</ref>- <ref type="bibr">[6]</ref>. In particular, the work <ref type="bibr">[6]</ref> considers the formulation <ref type="bibr">(1)</ref> and shows that under a standard generative model of the data, one can recover X from certain local minimizers of problem <ref type="bibr">(1)</ref>.</p><p>&#8226; Robust Subspace Recovery (RSR). RSR is a fundamental problem in machine learning and data mining <ref type="bibr">[7]</ref>. It is concerned with fitting a linear subspace to a dataset corrupted by outliers. Specifically, given a dataset Y = [X, O]&#915; &#8712; R n&#215;(p1+p2) , where the columns of X &#8712; R n&#215;p1 are the inlier points spanning a d-dimensional subspace S of R n (d &lt; p 1 ), the columns of O &#8712; R n&#215;p2 are outlier points without a linear structure, and &#915; &#8712; R (p1+p2)&#215;(p1+p2) is an unknown permutation, the goal is to recover the inlier subspace S, or equivalently, to cluster the points into inliers and outliers. One recently proposed approach for solving this problem is the so-called dual principal component pursuit (DPCP) <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>. A key task in DPCP is to find a hyperplane that contains all the inliers. Such a task can be tackled by solving problem <ref type="bibr">(1)</ref>. In fact, it has been shown in <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref> that under certain conditions on the inliers and outliers, any global minimizer of problem (1) yields a vector that is orthogonal to the inlier subspace S.</p><p>Despite its attractive theoretical properties in various applications, problem <ref type="bibr">(1)</ref> is numerically challenging to solve due to its nonsmooth objective and nonconvex constraint. Nevertheless, the manifold structure of the constraint set M := {x &#8712; R n | x 2 = 1} suggests that problem (1) could be amenable to manifold optimization techniques <ref type="bibr">[10]</ref>. One approach is to apply smoothing to the nonsmooth objective in <ref type="bibr">(1)</ref> and use existing algorithms for Riemannian smooth optimization to solve the resulting problem. For instance, when tackling the ODL problem, Sun et al. <ref type="bibr">[5]</ref>, <ref type="bibr">[11]</ref> and Gilboa et al. <ref type="bibr">[12]</ref> proposed to replace the absolute value function t &#8594; |t| by the smooth surrogate t &#8594; h &#181; (t) = &#181; log(cosh(t/&#181;)) with &#181; &gt; 0 being a smoothing parameter, while Qu et al. <ref type="bibr">[13]</ref> proposed to replace the 1 norm with the 4 norm. They then solve the resulting smoothed problems by either the Riemannian trust-region method <ref type="bibr">[5]</ref>, <ref type="bibr">[11]</ref> or the Riemannian gradient descent method <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>. Although it can be shown that these methods will yield the desired orthonormal basis under a standard generative model of the data, the smoothing approach can introduce significant analytic and computational difficulties <ref type="bibr">[6]</ref>. Another approach, which avoids smoothing the objective, is to solve (1) directly using Riemannian nonsmooth optimization techniques. For instance, in the recent work <ref type="bibr">[6]</ref>, <ref type="bibr">Bai et al.</ref> proposed to solve (1) using the Riemannian subgradient method (RSGM), which generates the iterates via</p><p>Here, &#951; k &gt; 0 is the step size; &#8706; R f (&#8226;) denotes the Riemannian subdifferential of f and is given by</p><p>where I n is the n &#215; n identity matrix and &#8706;f (&#8226;) is the usual subdifferential of the convex function f [14, <ref type="bibr">Section 5]</ref>. Bai et al. <ref type="bibr">[6]</ref> showed that for the ODL problem, RSGM with a suitable initialization will converge at a sublinear rate to a basis vector with high probability under a standard generative model of the data. Moreover, by running RSGM O(n log n) times, each time with an independent random initialization, one can recover the entire orthonormal basis with high probability. Around the same time, Zhu et al. <ref type="bibr">[9]</ref> proposed a projected subgradient method (PSGM) for solving <ref type="bibr">(1)</ref>. The method generates the iterates via</p><p>The updates (2) and (3) differ in the choice of the direction v k -the former uses a Riemannian subgradient of f at x k , while the latter uses a usual Euclidean subgradient. For the DPCP formulation of the RSR problem, Zhu et al. <ref type="bibr">[9]</ref> showed that under certain assumptions on the data, PSGM with suitable initialization and piecewise geometrically diminishing step sizes will converge at a linear rate to a vector that is orthogonal to the inlier subspace S. The step sizes take the form &#951; k = &#951; (k-K0)/K +1 , where &#951; &#8712; (0, 1) and K 0 , K &#8805; 1 satisfy certain conditions. In practice, however, the parameters &#951;, K 0 , K are difficult to determine. Therefore, Zhu et al. <ref type="bibr">[9]</ref> also proposed a PSGM with modified backtracking line search (PSGM-MBLS), which works well in practice but has no convergence guarantee.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Motivations for this Work</head><p>Although the results in <ref type="bibr">[6]</ref>, <ref type="bibr">[9]</ref> demonstrate, both theoretically and computationally, the efficacy of RSGM and PSGM for solving instances of (1) that arise from the ODL and RSR problems, respectively, two fundamental questions remain. First, while PSGM can be shown to achieve a linear convergence rate on the DPCP formulation of the RSR problem <ref type="bibr">[9]</ref>, only a sublinear convergence rate has been established for RSGM on the ODL problem <ref type="bibr">[6]</ref>. Given the similarity of the updates (2) and (3), it is natural to ask whether the slower convergence rate of RSGM is an artifact of the analysis or due to the inherent structure of the ODL problem. Second, the convergence analyses in <ref type="bibr">[6]</ref>, <ref type="bibr">[9]</ref> focus only on the ODL and RSR problems. In particular, they do not shed light on the performance of RSGM or PSGM when tackling general instances of problem <ref type="bibr">(1)</ref>. It would be of interest to fill this gap by identifying or developing practically fast methods that have more general convergence guarantees, especially since different applications may give rise to instances of problem <ref type="bibr">(1)</ref> with different structures. In a recent attempt to address these questions, Li et al. <ref type="bibr">[15]</ref> showed, among other things, that RSGM will converge at the sublinear rate of O(k -1/4 ) (here, k is the iteration counter) when applied to a general instance of problem <ref type="bibr">(1)</ref> and at a linear rate when applied to a so-called sharp instance of problem <ref type="bibr">(1)</ref>. Informally, an optimization problem is said to possess the sharpness property if the objective function grows linearly with the distance to a set of local minima <ref type="bibr">[16]</ref>. Such a property plays a crucial role in establishing fast convergence guarantees for a host of iterative methods; see, e.g., <ref type="bibr">[15]</ref>- <ref type="bibr">[17]</ref> and also <ref type="bibr">[18]</ref>- <ref type="bibr">[20]</ref> for related results. Since the ODL problem and the DPCP formulation of the RSR problem are known to possess the sharpness property under certain assumptions on the data <ref type="bibr">[6]</ref>, <ref type="bibr">[9]</ref>, the results in <ref type="bibr">[15]</ref> imply that RSGM will converge linearly on these problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Our Contributions</head><p>In this paper, we depart from the subgradient-type approaches (such as RSGM (2) and PSGM (3)) and present another method called the manifold proximal point algorithm (ManPPA) to tackle problem <ref type="bibr">(1)</ref>. At each iterate x k , ManPPA computes a search directon by minimizing the sum of f and a proximal term defined in terms of the Euclidean distance over the tangent space to M at x k . This should be contrasted with other existing PPAs on manifolds (see, e.g., <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>), in which the proximal term is defined in terms of the Riemannian distance. Such a difference is important. Indeed, although the search direction defined in ManPPA does not admit a closed-form formula, it can be computed in a highly efficient manner by exploiting the structure of problem <ref type="bibr">(1)</ref>; see Section II-B. However, the search direction defined in the existing PPAs on manifolds can be as difficult to compute as a solution to the original problem. Consequently, the applicability of those methods is rather limited.</p><p>We now summarize our contributions as follows:</p><p>1) We show that ManPPA has a global sublinear convergence rate of O(k -1/2 ) when applied to a general instance of problem <ref type="bibr">(1)</ref>. Moreover, we show that if the instance has the sharpness property, then the local convergence rate of ManPPA is at least quadratic. Although the sublinear rate result follows from the results in <ref type="bibr">[23]</ref>, the quadratic rate result is new. Moreover, both rates are superior to those of RSGM established in <ref type="bibr">[15]</ref>. Key to </p><p>Our interest in problem (4) stems from the observation that it provides alternative formulations of the ODL and RSR problems. Indeed, for the ODL problem, one can recover the entire orthonormal basis all at once by solving problem (4) with q = n. For the RSR problem, if one knows the dimension d of the inlier subspace S, then one can recover it by solving problem (4) with q = n-d. We show that a good feasible solution to problem (4) can be found in a column-by-column manner by suitably modifying ManPPA. Although the proposed method is only a heuristic, our extensive numerical experiments show that it yields solutions of comparable quality to but is significantly faster than existing methods on the ODL and RSR problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Organization and Notation</head><p>The rest of the paper is organized as follows. In Section II, we present ManPPA for solving problem (1) and describe a highly efficient method for solving the subproblem that arises in each iteration of ManPPA. We also analyze the convergence behavior of ManPPA. In Section III, we propose StManPPA, a stochastic version of ManPPA that is well suited for large-scale computation, and analyze its convergence behavior. In Section IV we discuss an extension of ManPPA for solving the matrix analog (4) of problem <ref type="bibr">(1)</ref>. In Section V, we apply ManPPA to solve the ODL problem and the DPCP formulation of the RSR problem and compare its performance with some existing methods. We draw our conclusions in Section VI.</p><p>Besides the notation introduced earlier, we use L to denote the Lipschitz constant of f ; i.e., |f (x)-f (y)| &#8804; L x-y 2 for all x, y &#8712; R n (note that L &#8804; &#8730; n max u&#8712;R n Y u 1 / u 1 ). Given a closed set C &#8838; R n , we use Proj C (x) to denote the projection of x onto C and dist(x, C) := inf y&#8712;C yx 2 to denote the distance between x and C. Given a proper lower semicontinuous function h : R n &#8594; R &#8746; {+&#8734;}, its proximal mapping is given by prox h (x) = argmin w&#8712;R n h(w)</p><p>2 . Given two vectors x, y &#8712; R n , we use x, y or x y to denote their usual inner product. Other notation is standard.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. A MANIFOLD PROXIMAL POINT ALGORITHM</head><p>Since problem (1) is nonconvex, our goal is to compute a stationary point of (1), which is a point x &#8712; M that satisfies the first-order optimality condition</p><p>(see <ref type="bibr">[14]</ref>). In the recent work <ref type="bibr">[23]</ref>, Chen et al. considered the more general problem of minimizing the sum of a smooth function and a nonsmooth convex function over the Stiefel manifold and developed a manifold proximal gradient method (ManPG) for finding a stationary point of it. When specialized to solve problem (1), the method generates the iterates via</p><p>where the search direction d k is given by</p><p>and &#945; k &gt; 0, t &gt; 0 are the step sizes. As the reader may readily recognize, without the constraint d x k = 0, the subproblem ( <ref type="formula">6</ref>) is simply computing the proximal mapping of f at x k and coincides with the update of the classic proximal point algorithm (PPA) <ref type="bibr">[24]</ref>. The constraint d x k = 0 in <ref type="bibr">(6)</ref>, which states that the search direction d should lie on the tangent space to M at x k , is introduced to account for the manifold constraint in problem <ref type="bibr">(1)</ref> and ensures that the next iterate x k+1 achieves sufficient decrease in objective value. Motivated by the above discussion, we call the method obtained by specializing ManPG to the setting of problem (1) ManPPA and present its details below:</p><p>Algorithm 1 ManPPA for Solving Problem (1)</p><p>1: Input: x 0 &#8712; M, &#946; &#8712; (0, 1), t &gt; 0.</p><p>2: for k = 0, 1, 2, . . . do 3:</p><p>Solve the subproblem (6) to obtain d k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4:</head><p>Let j k be the smallest nonnegative integer such that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:</head><p>Set x k+1 according to <ref type="bibr">(5)</ref> with &#945; k = &#946; j k . 6: end for Naturally, ManPPA inherits the properties of ManPG established in <ref type="bibr">[23]</ref>. However, due to the structure of problem (1), many of the developments in <ref type="bibr">[23]</ref> have to be refined when designing ManPPA. In particular, the SSN method used by ManPG for finding the search direction in each iteration requires the computation of the proximal mapping of the nonsmooth part of the objective function. However, due to the presence of the matrix Y , the objective function f of problem (1) does not have an easily computable proximal mapping. As such, the SSN method proposed in <ref type="bibr">[23]</ref> cannot efficiently solve the subproblem <ref type="bibr">(6)</ref>. To circumvent this difficulty, we propose to use an inexact ALM, which can efficiently compute an accurate solution to <ref type="bibr">(6)</ref>; see Section II-B. Now, let us state the following result, which shows that the line search step in line 4 of Algorithm 1 is well defined. It simplifies <ref type="bibr">[23,</ref><ref type="bibr">Lemma 5.2]</ref> and yields sharper constants. The proof can be found in Appendix B.</p><p>As a result, we have &#945; k = &#946; j k &gt; &#946; &#8113; for any k &#8805; 0 in Algorithm 1, which implies that the line search step terminates after at most log &#946; &#8113; + 1 iterations. In particular, if t &#8804; 1/L, then we have &#8113; = 1, which implies that we can take j k = 0 in line 4 of Algorithm 1; i.e., no line search is needed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Convergence Analysis of ManPPA</head><p>In this subsection, we study the convergence behavior of ManPPA. Recall from <ref type="bibr">[23,</ref><ref type="bibr">Lemma 5.3</ref>] that if d k = 0 in (6), then x k &#8712; M is a stationary point of problem <ref type="bibr">(1)</ref>. This motivates us to call x k &#8712; M an -stationary point of problem <ref type="bibr">(1)</ref> with &#8805; 0 if the solution d k to (6) satisfies d k 2 &#8804; . By specializing the convergence results in <ref type="bibr">[23,</ref><ref type="bibr">Theorem 5.5]</ref> for ManPG to ManPPA, we obtain the following theorem:</p><p>Theorem 1 shows that ManPPA has an iteration complexity of O( -2 ), which is superior to the O( -4 ) bound established for RSGM in <ref type="bibr">[15]</ref>. Now, let us analyze the convergence rate of ManPPA in the setting where problem (1) possesses the sharpness property. Such a setting is highly relevant in applications, as both the ODL problem and DPCP formulation of the RSR problem give rise to sharp instances of problem (1) under certain assumptions on the data; see <ref type="bibr">[6,</ref><ref type="bibr">Proposition C.8</ref>] and <ref type="bibr">[15,</ref><ref type="bibr">Proposition 4]</ref>. To proceed, we first introduce the notion of sharpness. Definition 1 (Sharpness; see, e.g., <ref type="bibr">[16]</ref>). We say that X &#8838; M is a set of weak sharp minima for the function f with parameters</p><p>From the definition, we see that if X is a set of weak sharp minima of f , then it is the set of minimizers of f over B(&#948;). Moreover, the function value grows linearly with the distance to X . In the presence of such a regularity property, ManPPA can be shown to converge at a much faster rate. The following result, which has not appeared in the literature before and is thus new, constitutes the first main contribution of this paper.</p><p>Theorem 2. Suppose that X &#8838; M is a set of weak sharp minima for the function f with parameters (&#945;, &#948;). Let {x k } k be the sequence generated by Algorithm 1 with dist(x 0 , X ) &lt; &#948; := min &#948;, &#945; L and t &#8804; min</p><p>Theorem 2 establishes the quadratic convergence rate of ManPPA when applied to a sharp instance of problem <ref type="bibr">(1)</ref>. Again, this is superior to the linear convergence rate of RSGM established in <ref type="bibr">[15]</ref> for this setting. The proof of Theorem 2 can be found in Appendix C. Note that since M is nonconvex, one cannot directly apply standard convergence analysis techniques for PPA (see, e.g., <ref type="bibr">[24]</ref>) to obtain Theorem 2. The key to overcoming this diffculty is the new Riemannian subgradient inequality we establish in Proposition 5 (see Appendix C), which provides a path for extending the convergence analysis of PPA to that of ManPPA.</p><p>It should be pointed out that the results in Theorems 1 and 2 do not assume any generative model of the data matrix Y . By contrast, the results developed in, e.g., <ref type="bibr">[6]</ref> for the ODL problem and <ref type="bibr">[9]</ref> for the DPCP formulation of the RSR problem do assume certain generative models of the data. Although the latter results may yield qualitatively sharper convergence guarantees for instances of (1) that arise from the ODL problem or the DPCP formulation of the RSR problem, the former apply to arbitrary instances of (1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Solving the Subproblem (6)</head><p>Observe that each iteration of ManPPA requires solving the subproblem (6) to obtain the search direction. Thus, the efficiency of ManPPA depends not only on its convergence rate (which has already been studied in Theorems 1 and 2) but also on how fast the subproblem (6) can be solved. To address the latter, we note that ( <ref type="formula">6</ref>) is a linearly constrained strongly convex quadratic minimization problem. This motivates us to adopt the SSN-based ALM originally developed in <ref type="bibr">[25]</ref> for LASSO-type problems to solve it. As we shall see, such an approach yields a highly efficient method for solving the subproblem <ref type="bibr">(6)</ref>.</p><p>To set the stage for our development, let us drop the index k from (6) for simplicity and set c = Y x. Then, the subproblem (6) can be equivalently written as</p><p>At this point, one may be tempted to use ADMM to solve problem <ref type="bibr">(11)</ref>. However, from a practical point of view, ADMM is often unable to return a high-accuracy solution in an efficient manner. Since ( <ref type="formula">11</ref>) is a subproblem in ManPPA, a low-accuracy solution will adversely affect the convergence rate of ManPPA. In fact, this has been observed in our numerical experiments. Therefore, we propose to use an inexact ALM, which can solve problem <ref type="bibr">(11)</ref> efficiently and accurately. This makes it possible for ManPPA to achieve fast convergence. To describe the algorithm, let us first write down the augmented Lagrangian function corresponding to <ref type="bibr">(11)</ref>:</p><p>Here, y &#8712; R and z &#8712; R p are Lagrange multipliers (dual variables) associated with the constraints in <ref type="bibr">(11)</ref> and &#963; &gt; 0 is a penalty parameter. Then, a generic iteration of the inexact ALM is given by</p><p>where the penalty parameters {&#963; j } j are chosen such that 0 &lt; &#963; j &#963; &#8734; &#8804; +&#8734; <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>. Since the subproblem (13a) can only be solved inexactly in general, we adopt the following stopping criteria, which are standard in the literature (see <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>):</p><p>Here, &#936; * j is the optimal value of (13a). Equipped with conditions (14a)-(14c), we can show that starting from any initial point (d 0 , u 0 ; y 0 , z 0 ), the inexact ALM (13) will converge at an asymptotic superlinear rate to an optimal solution to problem <ref type="bibr">(11)</ref>. This result, which constitutes the second main contribution of this paper, is obtained from a new perturbation analysis of the solution set of the subproblem ( <ref type="formula">6</ref>) and its dual. The details can be found in the companion technical report of this paper <ref type="bibr">[27]</ref>. Now, it remains to discuss how to solve the subproblem (13a) in an efficient manner. Again, let us drop the index j in (13a) for simplicity. By simple manipulation, we have</p><p>.</p><p>Consider the function d &#8594; &#968;(d) := inf u&#8712;R p &#936;(d, u). Upon letting w = Y d + c + z &#963; &#8712; R p and using the definition of the proximal mapping of u &#8594; h(u) := t u 1 , we have</p><p>It follows that ( d, &#363;) = argmin d&#8712;R n ,u&#8712;R p &#936;(d, u) if and only if</p><p>Using </p><p>Thus, we can find d by solving the nonsmooth equation</p><p>Towards that end, we apply an SSN method, which finds the solution by successive linearization of the map &#8711;&#968;. To implement the method, we first need to compute the generalized Jacobian of &#8711;&#968; [29, Definition 2.6.1], denoted by &#8706;(&#8711;&#968;). By the chain rule [29, Corollary of Theorem 2.6.6] and the Moreau decomposition, each element V &#8712; &#8706;(&#8711;&#968;) takes the form</p><p>where Q &#8712; &#8706;prox h/&#963; (w). Using the definition of h, it can be shown that the diagonal matrix Q = Diag(q) with</p><p>is an element of &#8706;prox h/&#963; (w) [25, Section 3.3] and hence can be used to define an element V &#8712; &#8706;(&#8711;&#968;) via <ref type="bibr">(16)</ref>. Note that the matrix V so defined is positive definite. As such, the following generic iteration of the SSN method for solving <ref type="bibr">(15)</ref> is well defined:</p><p>Here, &#961; j &gt; 0 is the step size. Moreover, since I p -Q is a diagonal matrix whose entries are either 0 or 1, the matrix V can be assembled in a very efficient manner; again, see <ref type="bibr">[25]</ref>.</p><p>In the companion technical report <ref type="bibr">[27]</ref>, we present a detailed implementation of the SSN method ( <ref type="formula">17</ref>) and show that it will converge at a superlinear rate to the unique solution d to <ref type="bibr">(15)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. STOCHASTIC MANIFOLD PROXIMAL POINT ALGORITHM</head><p>In this section, we propose, for the first time, a stochastic ManPPA (StManPPA) for solving problem <ref type="bibr">(1)</ref>, which is well suited for the setting where p (typically representing the number of data points) is much larger than n (typically representing the ambient dimension of the data points). To begin, observe that problem (1) has the finite-sum structure</p><p>where y j &#8712; R n is the j-th column of Y . When p is extremely large, computing the matrix-vector product Y x can be expensive. To circumvent this difficulty, in each iteration of StManPPA, a column of Y , say y j , is randomly chosen and the search direction d k is given by</p><p>The key advantage of StManPPA is that the subproblem ( <ref type="formula">18</ref>) admits a closed-form solution that is very easy to compute.</p><p>Proposition 2. Let &#181; = t(y j x k ). Then, the solution to <ref type="bibr">(18)</ref> is given by</p><p>Proof. The first-order optimality conditions of (18) are</p><p>Suppose that d is a solution to <ref type="bibr">(20)</ref>. If</p><p>, and (20a) implies that 0 = d/t + y j&#955;x k . This, together with (20b) and the fact that</p><p>This establishes the first case in <ref type="bibr">(19)</ref>. The other two cases in <ref type="bibr">(19)</ref>, which correspond to y j (x k + d) &lt; 0 and y j (x k + d) = 0, can be derived using a similar argument.</p><p>We now present the details of StManPPA in Algorithm 2. It is worth noting that our proposed StManPPA is different from the ones developed in the recent work <ref type="bibr">[30]</ref>. Indeed, in each iteration, the former only needs to solve a subproblem that involves a single component of the objective function, while the latter need to compute the proximal mapping of the entire objective function.</p><p>Algorithm 2 StManPPA for Solving Problem (1) 1: Input: x 0 &#8712; M, t 0 , t 1 , . . . , t T &gt; 0. 2: for k = 0, 1, . . . , T do 3:</p><p>Select j k &#8712; {1, . . . , p} uniformly at random and solve the subproblem <ref type="bibr">(18)</ref> with j = j k , t = t k to obtain d k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4:</head><p>Set x k+1 = Proj M (x k + d k ). 5: end for 6: Output: x = x k with probability t k / T k=0 t k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Convergence Analysis of StManPPA</head><p>In this section, we present our convergence results for StManPPA. Let us begin with some preparations. Define f j : R n &#8594; R to be the function f j (x) = y j x and let L j &gt; 0 denote the Lipschitz constant of f j , where j = 1, . . . , p. Set L := max j&#8712;{1,...,p} L j . Furthermore, define the Moreau envelope and proximal mapping on M by</p><p>respectively. The proximal mapping mprox is well defined since the constraint set M is compact. As it turns out, the proximal mapping mprox can be used to define an alternative notion of stationarity for problem <ref type="bibr">(1)</ref>. Indeed, for any &#955; &gt; 0 and x &#8712; M, if we denote x = mprox &#955;f (x), then the optimality condition of (21) yields</p><p>Since I n -x x is a projection operator and hence nonexpansive, we obtain</p><p>In particular, if 1 &#955; xx 2 &#8804; , then (i) x is -stationary in the sense that dist(0, &#8706; R f ( x)) &#8804; and (ii) x is close to the -stationary point x. This motivates us to use</p><p>as a stationarity measure for problem <ref type="bibr">(1)</ref>. We call x &#8712; M an -nearly stationary point of problem (1) if &#920; &#955; (x) &#8804; . It is worth noting that such a notion has also been used in <ref type="bibr">[15]</ref> to study the stochastic RSGM.</p><p>We are now ready to establish the following convergence rate result for StManPPA, which constitutes the third main contribution of this paper. Theorem 3. For any &#955; &#8712; (0, 1/(p L)), the point x output by Algorithm 2 satisfies</p><p>where the expectation is taken over all random choices made by the algorithm. In particular, if the step sizes {t k } k satisfy</p><p>The proof of Theorem 3 can be found in Appendix D. Again, it makes crucial use of our newly established Riemannian subgradient inequality (see Appendix C, Proposition 5), which allows StManPPA to be analyzed in a similar way as various Euclidean stochastic methods. We remark that the iteration complexity bound O( -4 ) of StManPPA established in Theorem 3 is comparable to that of RSGM established in <ref type="bibr">[15]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXTENSION TO STIEFEL MANIFOLD CONSTRAINT</head><p>In this section, we consider the matrix analog (4) of problem <ref type="bibr">(1)</ref>, which also arises in many applications such as certain "one-shot" formulations of the ODL and RSR problems (see Section I-B). Currently, there are two existing approaches for solving problem (4), namely a sequential linear programming (SLP) approach and an iteratively reweighted least squares (IRLS) approach <ref type="bibr">[8]</ref>. In the SLP approach, the columns of X are extracted one at a time. Suppose that we have already obtained the first columns of X ( = 0, 1, . . . , q -1) and arrange them in the matrix X &#8712; R n&#215; (with X 0 = 0). Then, the ( + 1)-st column of X is obtained by solving</p><p>This is achieved by the alternating linearization and projection (ALP) method, which generates the iterates via</p><p>Note that the z-subproblem is a linear program, which can be efficiently solved by off-the-shelf solvers.</p><p>In the IRLS approach, the following variant of problem (4), which favors row-wise sparsity of Y X and has been studied by Lerman et al. in <ref type="bibr">[31]</ref>, is considered:</p><p>Here, Y X 1,2 denotes the sum of the Euclidean norms of the rows of Y X. The IRLS method for solving <ref type="bibr">(22)</ref> generates the iterates via</p><p>where w j,k = 1/ max{&#948;, X k-1 y j 2 } and &#948; &gt; 0 is a perturbation parameter to prevent the denominator from being 0. The solution X k can be obtained via an SVD and is thus easy to implement. However, there has been no convergence guarantee for the IRLS method so far. Recently, Wang et al. <ref type="bibr">[32]</ref> proposed a proximal alternating maximization method for solving a maximization version of (4), which arises in the so-called 1 -PCA problem (see <ref type="bibr">[7]</ref>). However, the method cannot be easily adapted to solve problem <ref type="bibr">(4)</ref>.</p><p>The similarity between problems (1) and ( <ref type="formula">4</ref>) suggests that the latter can also be solved by ManPPA, which is indeed the case. However, the SSN method for solving the resulting nonsmooth equation (i.e., the matrix analog of ( <ref type="formula">15</ref>)) can be slow, as the dimension of the linear system (17) is high. Here, we propose an alternative method called sequential ManPPA, which solves (4) in a column-by-column manner and constitutes the fourth main contribution of this paper. The method is based on the observation that the objective function in (4) is separable in the columns of X = [x 1 , x 2 , . . . , x n ]. To find x 1 , we simply solve min</p><p>using ManPPA as it is an instance of problem <ref type="bibr">(1)</ref>. Suppose that we have found the first columns of X ( = 0, 1, . . . , q -1) and arrange them in the matrix Q &#8712; R n&#215; (with Q 0 = 0). Then, we find the ( + 1)-st column x +1 by solving</p><p>(23) As it turns out, problem ( <ref type="formula">23</ref>) is equivalent to the deflation process discussed in <ref type="bibr">[11]</ref> for sequentially recovering the columns of an orthogonal dictionary. Specifically, let V be an orthonormal basis of the orthogonal complement of Q . We can then find x +1 by solving</p><p>Note that ( <ref type="formula">24</ref>) has the same form as (1) and thus can be solved by RSGM or PSGM. By contrast, problem <ref type="bibr">(23)</ref> has an extra linear constraint and cannot be solved by RSGM or PSGM directly. Nevertheless, our ManPPA can solve both ( <ref type="formula">23</ref>) and <ref type="bibr">(24)</ref>. Let us now briefly discuss how to use ManPPA to solve the former. To simplify notation, let us drop the index and denote x = x +1 , Q = Q . In the k-th iteration of ManPPA, we update the iterate by <ref type="bibr">(5)</ref>, where the search direction d k is computed by</p><p>This subproblem can be solved using the SSN-based inexact ALM framework in Section II. We omit the details for succinctness. The sequential ManPPA is guaranteed to find a feasible solution to problem (4). Moreover, as we shall see in Section V, it is computationally much more efficient than the ALP and IRLS methods on the ODL problem and the DPCP formulation of the RSR problem. However, it remains open whether the solution found by sequential ManPPA is a stationary point of problem (4). We leave this question for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. NUMERICAL EXPERIMENTS</head><p>In this section, we compare our proposed ManPPA, StManPPA, and sequential ManPPA with the existing methods PSGM-MBLS, ALP, and IRLS. We do not include RSGM with diminishing step sizes in our comparison, as the numerical results in <ref type="bibr">[9]</ref> show that they are slower than PSGM-MBLS and cannot achieve high accuracy. In all the tests, we used the step size t = 0.1 for ManPPA and set the maximum number of iterations of ManPPA, inexact ALM, and SSN to 100, 30, and 20, respectively. We stopped ManPPA if the relative change in the objective values satisfies |f (x k ) -f (x k-1 )|/f (x k-1 ) &#8804; 10 -9 . In the k-th iteration of ManPPA, we stopped the inexact ALM if the primal and dual feasibility of problem <ref type="bibr">(11)</ref> </p><p>For SSN, we used the same termination criteria as the ones given in <ref type="bibr">[25]</ref> and solved the linear equation (17a) by Cholesky decomposition. We refer the reader to the companion technical report <ref type="bibr">[27]</ref> for the setting of the parameters in the inexact ALM and SSN.</p><p>For StManPPA, we used the piecewise geometrically diminishing step sizes t k = &#946; k/p t 0 for k = 0, 1, . . . , pT with t 0 = 0.6 and T = 500. Such step sizes are motivated by those used in PSGM <ref type="bibr">[9]</ref>. We use StManPPA-&#946; to specify the parameter &#946; used in the algorithm. We stopped the algorithm if the relative change in the objective values satisfies |f (x k ) -f (x k-1 )|/f (x k-1 ) &#8804; 10 -12 . For PSGM-MBLS, ALP, and IRLS, we used their default settings of the parameters. We stopped ALP if the change in the objective values satisfies |f (x k ) -f (x k-1 )| &#8804; 10 -6 , while we stopped IRLS if the change in the objective values satisfies |f (x k ) -f (x k-1 )| &#8804; 10 -11 . With these stopping criteria, the solutions returned by these algorithms usually achieve the same accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. DPCP Formulations of the RSR Problem</head><p>In this section, we consider the DPCP formulations of the RSR problem. For the recovery of a vector that is orthogonal to the inlier subspace (the vector case), we compared the performance of ManPPA, StManPPA, PSGM-MBLS, ALP, and IRLS on problem <ref type="bibr">(1)</ref>. For the recovery of the entire inlier subspace of known dimension d (the matrix case), we compared the performance of sequential ManPPA, PSGM-MBLS, ALP, and IRLS on problem (4) with q = n -d.</p><p>1) Synthetic Data: We first test the algorithms on synthetic data. The data matrix takes the form Y = [X, O] &#8712; R n&#215;(p1+p2) . We generated the inlier data X as X = QC, where Q &#8712; R n&#215;d is an orthonormal matrix and C &#8712; R d&#215;p1 is a coefficient matrix. The matrix Q was generated by orthonormalizing an n &#215; d standard Gaussian random matrix via QR decomposition, while the matrix C was generated as a d &#215; p 1 standard Gaussian random matrix. The outlier data O &#8712; R n&#215;p2 was generated as a standard Gaussian random matrix. Finally, the columns of the matrix Y were normalized. The d-dimensional subspace spanned by X is denoted by S and its orthogonal complement is denoted by S &#8869; .</p><p>Vector case. We set the initial point x 0 of all algorithms to be the eigenvector of Y Y corresponding to the eigenvalue with minimum magnitude. We compared the performance of the algorithms on problem (1) with different dimension n, number of inliers p 1 , and number of outliers p 2 in Figure <ref type="figure">1</ref>. The first row of Figure <ref type="figure">1</ref> reports the principal angle 1 1 The principal angle is the distance between x k and S &#8869; . Any optimal solution x * to problem (1) is orthogonal to the inlier subspace S [9, <ref type="bibr">Theorem 1]</ref>. Using the Lipschitz continuity of f , we know that &#952; also measures the function value gap f (x) -f (x * ).</p><p>&#952; between x k and S &#8869; versus the iteration number. The second row reports &#952; versus CPU time. Note that x k = sin(&#952;)n + cos(&#952;)s, where n = Proj S (x k )/ Proj S (x k ) 2 and s = Proj S &#8869; (x k )/ Proj S &#8869; (x k ) 2 . From Figure <ref type="figure">1</ref>, we see that PSGM-MBLS is the fastest, while ManPPA is slightly slower. However, they are both much faster than other compared methods. Moreover, the principal angle &#952; of ManPPA decreases much faster than PSGM-MBLS in terms of iteration number. This can be attributed to the quadratic convergence rate of ManPPA (Theorem 2). In Figure <ref type="figure">2</ref> we report the quadratic fitting curves of the different algorithms. As shown in <ref type="bibr">[9]</ref>, the DPCP formulation can tolerate O((#inliers) 2 ) outliers; i.e., p 2 = O(p 2 1 ). For different p 2 &#8712; {40, 80, 120, . . . , 600}, we find the smallest p 1 &#8712; {60, 70, 80, . . . , 260} such that &#952; &lt; 10 -1 . Here, the principal angle &#952; is the mean value of 10 trials; i.e., we find pairs (p 1 , p 2 ) such that for a fixed p 1 , p 2 is the largest number of outliers that can be tolerated. We then use a quadratic function to fit these pairs (p 1 , p 2 ). A higher curve indicates that more outliers can be tolerated and hence the algorithm is more robust. From Figure <ref type="figure">2</ref>, we see that the curve corresponding to PSGM-MBLS is the lowest one and thus the least robust, while ManPPA and StManPPA are more robust. In Figure <ref type="figure">3</ref> we report the CPU time versus p 1 and p 2 . For each algorithm, the shadow area represents the standard deviation (std) of 10 random trials, while the line within the shadow is the mean of those trials. From the left two subfigures of Figure <ref type="figure">3</ref>, we see that the stds of IRLS and ALP are quite significant. In particular, they are usually more than ten times larger than the stds of other compared algorithms. To better illustrate the stds of the other four algorithms, we plot their CPU times in the right two subfigures of Figure <ref type="figure">3</ref>. From these two figures, we see that ManPPA has a larger std than those of the other three algorithms. Overall, we see that PSGM-MBLS is the fastest and ManPPA is second, and they are both much faster than the other compared algorithms. Figures <ref type="figure">2</ref> and<ref type="figure">3</ref> suggest that ManPPA is slightly slower than PSGM-MBLS but is more robust. Moreover, the choice of the parameter &#946; for StManPPA is crucial and challenging, as StManPPA-0.9 is faster but less robust than StManPPA-0.8. We leave the determination of the best parameter &#946; for StManPPA as a future work.  Matrix case. We solved problem (4) using sequential ManPPA and compared its performance with PSGM-MBLS (applied to (24)), ALP, and IRLS. We report the results for q = 2 and q = 4 in Figures <ref type="figure">4</ref> and<ref type="figure">5</ref>, respectively. Since ALP has a very high std, for better illustration of the CPU time comparison, we exclude it from the right two subfigures of Figures <ref type="figure">4</ref> and<ref type="figure">5</ref>. The results suggest that sequential ManPPA is not as robust as IRLS but is the second most efficient one among the four compared algorithms. Moreover, we find that although PSGM-MBLS is very efficient, its fitting curve is not very good. This is due to the fact that PSGM-MBLS is very sensitive to the choice of step size.</p><p>2) Real 3D Point Cloud Road Data: Next, we compared ManPPA with PSGM-MBLS on the road detection challenge of the KITTI dataset <ref type="bibr">[33]</ref>. This dataset contains image data together with the corresponding 3D points collected by a rotating 3D laser scanner. Similar to <ref type="bibr">[9]</ref>, we only used the 360&#176;3 D point clouds to determine which points lie on the road plane (inliers) and which do not (outliers). By using homogeneous coordinates, this can be cast as a robust hyperplane learning problem (1) in R 4 . As reported in <ref type="bibr">[9]</ref>, PSGM-MBLS is the fastest algorithm when compared with other state-of-the-art methods. Thus, we only compared the performance of ManPPA and PSGM-MBLS on problem <ref type="bibr">(1)</ref>. Table I reports the area under the Receiver Operator Curve (ROC) and the CPU time. We see that all ROC values of ManPPA are better than those of PSGM-MBLS, with some sacrifice on the CPU time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. ODL Problem</head><p>We generated instances of the ODL problem by first randomly generating an orthogonal matrix X &#8712; R n&#215;n and a Bernoulli-Gaussian matrix &#194; &#8712; R n&#215;p with parameter &#947; (see, e.g., <ref type="bibr">[3]</ref>), then setting Y = X &#194;.</p><p>Vector case. We first compared the performance of ManPPA and StManPPA with ALP, IRLS, and PSGM-MBLS on problem (1) with n = 30, p = 10n 1.5 . Figures 6 and 7  report the iteration numbers and CPU times of the compared algorithms. The quantity &#952; is the angle between x k returned by the algorithm and its nearest column in X. From Figures <ref type="figure">6</ref> and<ref type="figure">7</ref>, we see that PSGM-MBLS is the fastest algorithm in terms of CPU time, while ManPPA is slightly slower. However, they are both much faster than the other compared algorithms. Moreover, we see that ManPPA is much faster than PSGM-MBLS in terms of iteration number. This again can be attributed to the quadratic convergence rate of ManPPA (Theorem 2).</p><p>In Figures <ref type="figure">8</ref> and<ref type="figure">9</ref> we report the linear fitting curves for log(n) and log(p) and CPU times of ManPPA, StManPPA-0.8, StManPPA-0.9, IRLS, ALP, and PSGM-MBLS. Note that for the ODL problem, it has been found empirically in <ref type="bibr">[6]</ref> that the sample size p and dimension n should satisfy Here the principal angle &#952; is the mean value of 10 trials. We then use a linear function to fit the points {(log(n), log(p))} n,p . From Figures <ref type="figure">8</ref> and<ref type="figure">9</ref>, we find that PSGM-MBLS is the fastest but its fitting curve is high, which suggests that it is not robust. This is because PSGM-MBLS is very sensitive to the choice of step size. ManPPA appears to be the second fastest but is very robust based on the fitting curve. Matrix case. To find the entire orthogonal basis, we use sequential ManPPA to solve problem (4) with q = n. We compared sequential ManPPA with PSGM-MBLS (applied to <ref type="bibr">(24)</ref>) and SLP based on ALP, and report the results in Figures <ref type="figure">10</ref> and<ref type="figure">11</ref>. We see that sequential ManPPA is not as robust as ALP, but it is much faster than ALP. Note that there is nothing for IRLS to do, as it tackles the objective function Y X 1,2 , which is a constant when X X = I n . We also find that the fitting curve of PSGM-MBLS is very high, which indicates that it fails to recover the dictionary in many cases. Again, this is due to the high sensitivity of PSGM-MBLS to the choice of step size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS</head><p>In this paper, we presented ManPPA and its stochastic variant StManPPA for solving problem <ref type="bibr">(1)</ref>. By exploiting the manifold structure of the constraint set M, these methods not only are practically efficient but also possess convergence guarantees that are provably superior to those of existing subgradient-type methods. Using ManPPA as a building block, we also proposed a new sequential approach to solving the matrix analog (4) of problem (1). We conducted extensive numerical experiments to compare the performance of our proposed algorithms with existing ones on the ODL problem and DPCP formulation of the RSR problem. The results demonstrated the efficiency and efficacy of our proposed methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Useful Properties of Proj M</head><p>In this section, we collect some useful properties of the projector Proj M . Proposition 3. For any x &#8712; M and d &#8712; R n satisfying d x = 0 (i.e., d is a tangent vector at x), we have Moreover, if d 2 &#8804; D for some D &#8712; (0, +&#8734;), then</p><p>Proof. It is straightforward to verify that</p><p>We then have (25) by using the fact that</p><p>we get <ref type="bibr">(26)</ref>. Proposition 4. For any x, z &#8712; M and d &#8712; R n satisfying d x = 0, we have Proof. We compute</p><p>where <ref type="bibr">(27)</ref> follows from the fact that x + d</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Proof of Proposition 1</head><p>Since f is Lipschitz with constant L and d k x k = 0, we have by Proposition 3. Hence, for any &#945; &#8712; (0, &#8113;], we have</p><p>where (28a) follows from the convexity of f , (28b) holds because the strong convexity of the objective function in subproblem ( <ref type="formula">6</ref>), together with the optimality of d = d k and feasibility of d = 0 for (6), implies that f <ref type="figure"/>and<ref type="figure">(28c</ref>) is due to &#945; &#8804; 1/(tL). If t &#8804; 1/L, then &#8113; = 1. This completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Proof of Theorem 2</head><p>We begin with two preparatory results. The first states that the restriction of the objective function f in (1) on the nonconvex constraint set M satisfies a Riemannian subgradient inequality, which means that with respect to its Riemannian subgradient, the function f behaves almost like a convex function on M. Proposition 5. Let x &#8712; M and d &#8712; R n be such that d x = 0. Define x + = x + d. Then, for any z &#8712; M and s &#8712; &#8706;f (x + ), we have Proof. Since f is convex on R n , we have</p><p>where the second equality is due to x x = 1 and d x = 0, and the inequality follows from the fact that |x z -1| = The second establishes a key recursion for the iterates generated by ManPPA. Proposition 6. Let {x k } k be the sequence generated by Algorithm 1 with t &#8804; 1/L. For any x &#8712; M, we have</p><p>Proof. Since t &#8804; 1/L, we have x k+1 = Proj M (x k + d k ) by from Proposition 1. From the optimality condition of the subproblem (6), there exists an</p><p>Denoting x k + = x k + d k , we have We are now ready to prove Theorem 2. We first prove (9) by induction. Let x * &#8712; X be such that dist(x k , X ) = x k -x * 2 . By invoking Proposition 6 with x = x * , we have</p><p>where the last inequality follows from <ref type="bibr">(8)</ref>. Consider the function [0, &#948;] s &#8594; &#966;(s) = (1 + tL)s 2 -2t&#945;s + t 2 L 2 . Observe that &#966; attains its maximum at s = &#948; if &#948; &#8805; 2t&#945; 1+tL . Given that t &#8804; min In particular, we have dist(x k+1 , X ) &#8804; &#948; whenever dist(x k , X ) &#8804; &#948;. This establishes <ref type="bibr">(9)</ref>. Next, we prove <ref type="bibr">(10)</ref>. Again, let x * &#8712; X be such that dist(x k , X ) = x kx * 2 . Since &#945; &#8804; L by (8), we have &#948; &#8804; 1. This implies that x k x * = 2-x k -x * 2 2 2 &#8805; 1 2 . Hence, the vector dk = x *</p><p>x k x * -x k is well defined and satisfies dk x k = 0 (i.e., dk is a tangent vector at x k ), Proj M (x k + dk ) = x * , and dk 2 &#8804; &#8730; 3. By the strong convexity of the objective function in subproblem <ref type="bibr">(6)</ref> and noting the optimality of d k and feasibility of dk for (6), we have</p><p>Furthermore, by the Lipschitz continuity of f and Proposition 3, we get</p><p>Combining ( <ref type="formula">31</ref>), (32a), and (32b), we have</p><p>Since t &#8804; &#948; 2&#945;-L&#948; &#8804; 1/L, we have 1 2t -L 2 &#8805; 0. Moreover, since dist(x k+1 , X ) &#8804; &#948; &#8804; &#948;, we have f (x k+1 ) -f (x * ) &#8805; &#945; &#8226; dist(x k+1 , X ) by <ref type="bibr">(8)</ref>. It then follows from <ref type="bibr">(33)</ref> and Proposition 3 that</p><p>This completes the proof. (34)</p><p>From the optimality condition of (18), we get</p><p>Hence, we compute</p><p>where (35a) follows from Proposition 5 and (35b) is due to the Lipschitz continuity of f j k . Upon taking the expectation on both sides of (35) with respect to j k conditioned on x k , we obtain</p><p>where (36a) follows from the fact that Upon summing the above inequality over k = 0, 1, . . . T and noting that &#955; &lt; 1/(p L) and e &#955; (z) &#8805; 0 for any z &#8712; R n , we obtain</p><p>Upon dividing both sides of the above inequality by T k=0 t k and noting that the left-hand side becomes E &#920; &#955; ( x) 2 , the proof is complete.</p></div></body>
		</text>
</TEI>
