<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Outlier-Robust Sparse Estimation via Non-Convex Optimization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10398963</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yu Cheng</author><author>Ilias Diakonikolas</author><author>Rong Ge</author><author>Shivam Gupta</author><author>Daniel Kane</author><author>Mahdi Soltanolkotabi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks. The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In several modern machine learning (ML) applications, such as ML security [BNJT10, BNL12, SKL17, DKK + 19a] and exploratory analysis of real datasets, e.g., in population genetics [RPW + 02, PLJD10, LAT + 08, DKK + 17], typical datasets contain a non-trivial fraction of arbitrary (or even adversarial) outliers. Robust statistics [HRRS86, HR09] is the subfield of statistics aiming to design estimators that are tolerant to a constant fraction of outliers, independent of the dimensionality of the data. Early work in this field, see, e.g., <ref type="bibr">[Tuk60,</ref><ref type="bibr">Hub64,</ref><ref type="bibr">Tuk75]</ref> developed sample-efficient robust estimators for various basic tasks, alas with runtime exponential in the dimension.</p><p>During the past five years, a line of work in computer science, starting with [DKK + 16, LRV16], has developed the first computationally efficient robust high-dimensional estimators for a range of tasks. This progress has led to a revival of robust statistics from an algorithmic perspective (see, e.g., [DK19, DKK + 21] for surveys on the topic). In this work, we focus on high-dimensional estimation tasks in the presence of sparsity constraints. To rigorously study these problems, we need to formally define the model of data corruption. Throughout this work, we work with the following standard contamination model. Definition 1.1 (Strong Contamination Model, see [DKK + 16]). Given a parameter 0 &lt; &lt; 1/2 and a distribution family D on R d , the adversary operates as follows: The algorithm specifies a number of samples n, and n samples are drawn from some unknown D &#8712; D. The adversary is allowed to inspect the samples, remove up to n of them and replace them with arbitrary points. This modified set of n points is then given as input to the algorithm. We say that a set of samples is -corrupted if it is generated by the above process.</p><p>High-dimensional robust statistics is algorithmically challenging because the natural optimization formulations of such tasks are typically non-convex. The recent line of work on algorithmic robust statistics has led to a range of sophisticated algorithms. In some cases, such algorithms require solving large convex relaxations, rendering them computationally prohibitive for large-scale problems. In other cases, they involve a number of hyper-parameters that may require careful tuning. Motivated by these shortcomings of known algorithms, recent work [CDGS20, ZJS20] established an intriguing connection between high-dimensional robust estimation and non-convex optimization. The high-level idea is quite simple: Even though typical robust statistics tasks lead to non-convex formulations, it may still be possible to leverage the underlying structure to show that standard first-order methods provably and efficiently reach near-optimal solutions. Indeed, [CDGS20, ZJS20] were able to prove such statements for robust mean estimation under natural distributional assumptions. Specifically, these works established that any (approximate) stationary point of a well-studied non-convex formulation for robust mean estimation yields a near-optimal solution for the underlying robust estimation task.</p><p>In this work, we continue this line of work with a focus on sparse estimation tasks. Leveraging sparsity in high-dimensional datasets is a fundamental problem of significant practical importance. Various formalizations of this problem have been investigated in statistics and machine learning for at least the past two decades (see, e.g., [HTW15] for a textbook on the topic). We focus on robust sparse mean estimation and robust sparse PCA. Sparse mean estimation is arguably one of the most fundamental sparse estimation tasks and is closely related to the Gaussian sequence model <ref type="bibr">[Tsy08,</ref><ref type="bibr">Joh17]</ref>. The task of sparse PCA in the spiked covariance model, initiated in [Joh01], has been extensively investigated (see Chapter 8 of [HTW15] and references therein).</p><p>In the context of robust sparse mean estimation, we are given an -corrupted set of samples from a distribution with unknown mean &#181; &#8712; R d where &#181; is k-sparse, and we want to compute a vector &#181; close to &#181;. In the context of robust sparse PCA (in the spiked covariance model), we are given an -corrupted set of samples from a distribution with covariance matrix I + &#961;vv T , where v &#8712; R d is k-sparse and the goal is to approximate v. It is worth noting that for both problems, we have access to much fewer samples compared to the non-sparse case (roughly O(k 2 log d) instead of &#8486;(d)). Consequently, the design and analysis of optimization formulations for robust sparse estimation requires new ideas and techniques that significantly deviate from the standard (non-sparse) case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Our Results and Contributions</head><p>We show that standard first-order methods lead to robust and efficient algorithms for sparse mean estimation and sparse PCA. Our main contribution is to propose novel (non-convex) formulations for these robust estimation tasks, and to show that approximate stationarity suffices for near-optimality. We establish landscape results showing that any approximate stationary point of our objective function yields a near-optimal solution for the underlying robust estimation task. Consequently, gradient descent (or any other methods converging to stationarity) can solve these problems.</p><p>Our results provide new insights and techniques in designing and analyzing (non-convex) optimization formulations of robust estimation tasks. Our formulations and structural results immediately lead to simple and practical algorithms for robust sparse estimation. Importantly, the gradient of our objectives can be computed efficiently via a small number of basic matrix operations. In addition to their simplicity and practicality, our methods provably succeed under more general distributional assumptions compared to prior work.</p><p>For robust sparse mean estimation and robust sparse PCA, our landscape results require deterministic conditions on the original set of good samples. We refer to these conditions as stability conditions (Definitions 2.1 and 2.2, formally defined in Section 2). At a high level, they state that the first and second moments of a set of samples are stable when any -fraction of the samples are removed. These stability conditions hold with high probability for a set of clean samples drawn from natural families of distributions (e.g., subgaussian).</p><p>For robust sparse mean estimation, we establish the following result. Theorem 1.2 (Robust Sparse Mean Estimation). Let 0 &lt; &lt; 0 for some universal constant 0 and let &#948; &gt; . Let G be a set of n samples that is (k, , &#948;)-stable (per Definition 2.1) w.r.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>t. a distribution with unknown</head><p>There is an algorithm that on inputs S, k, , and &#948;, runs in polynomial time and returns a k-sparse vector</p><p>We emphasize that a key novelty of Theorem 1.2 is that the underlying algorithm is a first-order method applied to a novel non-convex formulation of the problem. The major advantage of our algorithm over prior work [BDLS17, DKK + 19b] is its simplicity, practicality, and the fact that it seamlessly applies to a wider class of distributions on the clean data.</p><p>As we will discuss in Section 3, when the ground-truth distribution D is subgaussian with unknown k-sparse mean &#181; &#8712; R d and identity covariance, a set of n = &#8486;(k<ref type="foot">foot_2</ref> log d/ 2 ) samples drawn from D is (k, , &#948;)-stable (Definition 2.1) with high probability for &#948; = O( log(1/ )). It follows as an immediate corollary of Theorem 1.2 that, given an -corrupted set of samples, we can compute a vector &#181; that is O(&#948;) = O( log(1/ )) close to the true mean &#181;. This sample complexity matches the known computational-statistical lower bounds [DKS17, BB20]. More generally, one can relax the concentration assumption on the clean data and obtain qualitatively similar error guarantees.</p><p>Next we state our main result for robust sparse PCA. Theorem 1.3 (Robust Sparse PCA). Let 0 &lt; &#961; &#8804; 1 and 0 &lt; &lt; 0 for some universal constant 0 . Let G be a set of n samples that is (k, , &#948;)-stable (as in Definition 2.2) w.r.t. a centered distribution with covariance &#931; = I + &#961;vv , for an unknown k-sparse</p><p>There is an algorithm that on inputs S, k, and , runs in polynomial time and returns a unit vector</p><p>Interestingly, our algorithm for robust sparse PCA is a first-order method applied to a simple convex formulation of the problem. We view the existence of a convex formulation as an intriguing fact that, surprisingly, was not observed in prior work. As we will discuss in Section 4, when the ground-truth distribution D is centered subgaussian with covariance &#931; = I + &#961;vv , for an unknown k-sparse unit vector v &#8712; R d , a set of n = &#8486;(k 2 log d/ 2 ) samples drawn from D is (k, , &#948;)-stable (Definition 2.2) with high probability for &#948; = O( log(1/ )). Therefore, our algorithm outputs a vector that is O( log(1/ )/&#961;) close to the true direction v. The sample complexity in this case nearly matches the computational-statistical lower bound of &#8486;(k 2 log d/ 2 ) [BR13] which holds even without corruptions. While the error guarantee of our algorithm is slightly worse compared to prior work [BDLS17, DKK + 19b] for Gaussian data (we get O( &#948;/&#961;) rather than O(&#948;/&#961;)), we note that our algorithm works for a broader family of distributions.</p><p>Prior Work on Robust Sparse Estimation. We provide a detailed summary of prior work for comparison. [BDLS17] obtained the first sample-efficient and polynomial-time algorithms for robust sparse mean estimation and robust sparse PCA. These algorithms succeed for Gaussian inliers and inherently use the ellipsoid method. The separation oracle required for the ellipsoid algorithm turns out to be another convex program -corresponding to an SDP to solve sparse PCA. As a consequence, the running time of these algorithms, while polynomially bounded, is impractically high. [LLC19] proposed an algorithm for robust sparse mean estimation via iterative trimmed hard thresholding, which can only tolerate a sub-constant fraction of corruptions. [DKK + 19b] gave iterative spectral robust algorithms for sparse mean estimation and sparse PCA. These algorithms are still quite complex and are only shown to succeed under Gaussian inliers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Overview of Our Approach</head><p>In this section, we give an overview of our approach for robust sparse mean estimation. At a very high level, we assign a nonnegative weight to each data point and try to find a good set of (1 -)n samples. The constraint on the weight vector is that it represents at least a (fractional) set of (1 -)-portion of the input dataset. Formally, given n datapoints (X i ) n i=1 , the goal is to find a weight vector w &#8712; R n such that &#181; w = i w i X i is close to the true mean &#181;. The constraint on w is that it belongs to</p><p>which is the convex hull of all uniform distributions over subsets S &#8838; [n] of size |S| = (1 -)n.</p><p>Let &#931; w = i w i (X i -&#181; w )(X i -&#181; w ) denote the weighted empirical covariance matrix. It is well-known that if one can find w &#8712; &#8710; n, that minimizes the weighted empirical variance v &#931; w v for all k-sparse unit vectors v, then &#181; w must be close to &#181;. Unfortunately, it is NP-Hard to find the sparse direction v with the largest variance. To get around this issue, [BDLS17] considered the following convex relaxation, minimizing the variance for convex combinations of sparse directions:</p><p>Given w, the optimal A can be found using semidefinite programming (SDP). [ZJS20] observed that any stationary point w of (1) gives a good solution for robust sparse mean estimation. However, solving (1) requires convex programming to compute the gradient in each iteration. As explained in the proceeding discussion, our approach circumvents this shortcoming, leading to a formulation for which each gradient can be computed using only basic matrix operations.</p><p>In this work, we propose and analyze the following optimization formulation:</p><p>where A F,k,k is the Frobenius norm of the k 2 entries of A with largest magnitude, with the additional constraint that these k 2 entries are chosen from k rows with k entries in each row.</p><p>We prove that any stationary point of f (w) yields a good solution for robust sparse mean estimation.</p><p>Here we provide a brief overview of our proof (see Section 3 for more details). Given a weight vector w, we show that if w is not a good solution, then moving toward w (the weight vector corresponding to the uniform distribution on the clean input samples) will decrease the objective value. Formally, we will show that, for any 0 &lt; &#951; &lt; 1,</p><p>We can then take &#8226; F,k,k norm on both sides (after subtracting I) and show that the third term can be essentially ignored. If the third term were not there, we would have</p><p>Therefore, if w is a bad solution with f (w) much larger than f (w ), then w cannot be a stationary point because f decreases when we move from w to (1 -&#951;)w + &#951;w .</p><p>Remark 1.4. The technical overview for robust sparse PCA follows a similar high-level approach, but is somewhat more technical. It is deferred to Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Roadmap.</head><p>In Section 2, we introduce basic notations and the deterministic stability conditions that we require on the good samples. We present our algorithms and analysis for robust sparse mean estimation in Section 3 and robust sparse PCA in Section 4. In Section 5, we evaluate our algorithm on synthetic datasets and show that it achieves good statistical accuracy under various noise models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries and Background</head><p>Notation. For a positive integer n, let [n] = {1, . . . , n}. For a vector v, we use v 0 , v 1 , v 2 , and v &#8734; for the number of non-zeros, the 1 , 2 , and &#8734; norm of v respectively. Let I be the identity matrix. For a matrix A, we use A 2 , A F , tr(A) for the spectral norm, Frobenius norm, and trace of A respectively. For two vectors x, y, let x y denote their inner product. For two matrices A, B, we use A &#8226; B = tr(A B) for their entrywise inner product. A matrix A is said to be positive semidefinite (PSD) if x Ax &#8805; 0 for all x. We write A B iff (B -A) is PSD.</p><p>For a vector w &#8712; R n , let diag(w) &#8712; R n&#215;n denote a diagonal matrix with w on the diagonal. For a matrix A &#8712; R n&#215;n , let diag(A) &#8712; R n denote a column vector with the diagonal of A. For a vector v &#8712; R d and a set S &#8838; [d], we write v S &#8712; R d for a vector that is equal to v on S and zero everywhere else. Similarly, for a matrix A &#8712; R d&#215;d and a set S &#8838; ([d] &#215; [d]), we write A S for a matrix that is equal to A on S and zero everywhere else.</p><p>For a vector v, we define v 2,k = max |S|=k v S 2 to be the maximum 2 -norm of any k entries of v. For a matrix A, we define A F,k 2 to be the maximum Frobenius norm of any k 2 entries of A. Moreover, we define A F,k,k to be the maximum Frobenius norm of any k 2 entries with the extra requirement that these entries must be chosen from k rows with k entries in each row. Formally,</p><p>Sample Reweighting Framework. We use n for the number of samples, d for the dimension, and for the fraction of corrupted samples. For sparse estimation, we use k for the sparsity of the ground-truth parameters. We use G for the original set of n good samples. We use S = G &#8746; B for the input samples after the adversary replaced -fraction of G , where G &#8834; G is the set of remaining good samples and B is the set of bad samples (outliers) added by the adversary. Note that |G| = (1 -)n and |B| = n.</p><p>Given n samples X 1 , . . . , X n , we write X &#8712; R d&#215;n as the sample matrix where the i-th column is X i . For a weight vector w &#8712; R n , we use &#181; w = Xw = i w i X i for the weighted empirical mean and</p><p>for the weighted empirical covariance. Let &#8710; n, be the convex hull of all uniform distributions over subsets</p><p>In other words, every w &#8712; &#8710; n, corresponds to a fractional set of (1 -)n samples. We use w to denote the uniform distribution on G (the remaining good samples in S).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Deterministic Stability Conditions.</head><p>For robust sparse mean estimation and robust sparse PCA, we require the following conditions respectively. Definition 2.1 (Stability Conditions for Sparse Mean). A set of n samples G = (X i ) n i=1 is said to be (k, , &#948;)-stable (w.r.t. a distribution with mean &#181;) iff for any weight vector w &#8712; &#8710; n,2 , we have &#181; w -&#181; 2,k &#8804; &#948; and &#931; w -I F,k,k &#8804; &#948; 2 / , where &#181; w and &#931; w are the weighted empirical mean and covariance matrix respectively, and the &#8226; F,k,k norm is defined in Equation (2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2.2 (Stability Conditions for Sparse PCA). A set of</head><p>First-Order Stationary Points. We give a formal definition of the notion of (approximate) firstorder stationary point that we use in this paper. Definition 2.3 (Approximate Stationary Points). Fix a convex set K and a differentiable function f . For &#947; &#8805; 0, we say that x &#8712; K is a &#947;-stationary point of f iff the following condition holds: For any unit vector u where x + &#945;u &#8712; K for some &#945; &gt; 0, we have u &#8711;f (x) &#8805; -&#947;.</p><p>We note that the objective functions studied in this paper are not everywhere differentiable. This is because, taking the &#8226; F,k,k norm as an example, there can be ties in choosing the largest k 2 entries. When the function f is not differentiable, we use &#8711;f informally to denote an element of the sub-differential. We will show in Appendix C that, while f is not differentiable, it does have a nonempty subdifferential, as it can be written as the pointwise maximum of differentiable functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Robust Sparse Mean Estimation</head><p>In this section, we present our non-convex approach for robust sparse mean estimation. We will optimize the following objective, where &#8226; F,k,k is defined in Equation (2):</p><p>We will show that the objective function (3) has no bad stationary points (Theorem 3.1). In other words, every first-order stationary point of f yields a good solution for robust sparse mean estimation.</p><p>Our algorithm is stated in Algorithm 1. As a consequence of our landscape result (Theorem 3.1), we know that Algorithm 1 works no matter how we find a stationary point of f (because any stationary point works), so we intentionally did not specify how to find such a point. As a simple illustration, we show that (projected) gradient descent can be used to minimize f . The convergence analysis and iteration complexity are provided in Appendix C.</p><p>Algorithm 1: Robust sparse mean estimation.</p><p>Input: k &gt; 0, 0 &lt; &lt; 0 , and an -corrupted set of samples (X i ) n i=1 drawn from a distribution with k-sparse mean &#181;.<ref type="foot">foot_3</ref> Output: a vector &#181; that is close to &#181;.</p><p>1: Find a first-order stationary point w &#8712; &#8710; n, of the objective min w f (w) = &#931; w -I F,k,k . 2: Return &#181; = (&#181; w ) Q where Q is a set of k entries of &#181; w with largest magnitude.</p><p>Formally, we first prove that Algorithm 1 can output a vector &#181; &#8712; R d that is close to &#181; in &#8226; 2,k norm, as long as the good samples satisfies the stability condition in Definition 2.1. Theorem 3.1. Fix k &gt; 0, 0 &lt; &lt; 0 , and &#948; &gt; . Let G be a set of n samples that is (k, , &#948;)-stable (as in Definition 2.1) w.r.t. a distribution with unknown k-sparse</p><p>Once we have a vector &#181; w that is O(&#948;)-close to &#181; in &#8226; 2,k norm, we can guarantee that a truncated version of &#181; w (the output &#181; of Algorithm 1) is O(&#948;)-close to &#181; in the 2 -norm: Lemma 3.2. Fix two vectors x, y with x 0 &#8804; k and x -y 2,k &#8804; &#948;. Let z be a vector that keeps the k entries of y with largest absolute values and sets the rest to 0. We have x -z 2 &#8804; &#8730; 5&#948;.</p><p>Theorem 1.2 follows immediately from Theorem 3.1 and Lemma 3.2.</p><p>We can apply Theorem 1.2 to get an end-to-end result for subgaussian distributions. We show that the required stability conditions are satisfied with a small number of samples. Lemma 3.3. Fix k &gt; 0 and 0 &lt; &lt; 0 . Let G be a set of n samples that are drawn i.i.d. from a subgaussian distribution with mean &#181; and covariance I. If n = &#8486;(k 2 log d/ 2 ), then with probability at least 1 -exp(-&#8486;(k 2 log d)), G is (k, , &#948;)-stable (as in Definition 2.1) for &#948; = O( log(1/ )).</p><p>Combining Theorem 1.2 and Lemma 3.3, we know that given an -corrupted set of O(k 2 log d/ 2 ) samples drawn from a subgaussian distribution with k-sparse mean &#181;, the output of Algorithm 1 is O( log(1/ ))-close to &#181; in 2 -norm.</p><p>In the rest of this section, we will prove Theorem 3.1. Omitted proofs in this section are in Appendix A.</p><p>We start with some intuition on why we choose our objective function (3). We would like to design f (w) = g(&#931; w -I) to satisfy the following properties:</p><p>1. g(&#931; w -I) is an upper bound on v (&#931; w -I)v for all k-sparse unit vectors v &#8712; R d . This way, a small objective value implies that &#181; w -&#181; 2,k is small.</p><p>2. g(&#931; w -I) is small for w (the uniform distribution on G). This guarantees that a good w exists.</p><p>3. Triangle inequality on g. This allows us to upper bound the objective value when we move w toward w by the sum of g(&#8226;) of each term on the right-hand side:</p><p>4. g(uu ) is close to g(vv ) where v keeps only the k largest entries of u. We want to approximate &#181; in &#8226; 2,k norm, so intuitively g(&#931; w -I) should depend only on the largest k entries of (&#181; w -&#181;).</p><p>The last step uses ( 1 2 -4 ) &#931; w -I 2 &#8805; (4c 1 + 1) &#948; 2 which holds if &#8804; 1/10 and c 2 2 &#8805; 164 c 1 + 40.</p><p>It follows immediately that w cannot be a stationary point. Let u = w -w w -w 2 and h = &#951; w -w 2 .</p><p>We have w</p><p>By Definition 2.3, we know w cannot be a &#947;-stationary point of f for some &#947; = O(n 1/2 &#948; 2 -3/2 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Robust Sparse PCA</head><p>We consider a spiked covariance model for sparse PCA. In this model, there is a direction v &#8712; R d with at most k nonzero entries. The good samples are drawn from a ground-truth distribution with covariance &#931; = I + &#961;vv , where &#961; &gt; 0 is a parameter that intuitively measures the strength of the signal. We consider the more interesting case when &#961; &#8804; 1 (if &#961; is larger the problem becomes easier).</p><p>To solve the sparse PCA problem, we consider the following optimization problem, where</p><p>The objective function minimizes the Frobenius norm of the largest 2k 2 entries of a reweighted second-moment matrix M w . Note that f (w) is actually convex in w, because the matrix M w is linear in w and the &#8226; F,2k 2 norm is convex.</p><p>Let R be the support of vv . Intuitively, the k 2 entries in R could be large due to spiked covariance. By minimizing the norm of the largest 2k 2 entries, we hope to make the entries outside of R very small. Our algorithm is given in Algorithm 2.</p><p>Algorithm 2: Robust sparse PCA.</p><p>Input: k &gt; 0, 0 &lt; &lt; 0 , and an -corrupted set of samples (X i ) n i=1 drawn from a distribution with covariance I + &#961;vv for a k-sparse unit vector v. Output: a vector u that is close to v.</p><p>1: Find a first-order stationary point w &#8712; &#8710; n, of the objective min w f (w) = M w -I F,2k 2 . 2: Let A = M w -I. Let Q be the k 2 entries of A with largest magnitude.</p><p>3: Return u = the top eigenvector of (A Q + A Q ).</p><p>Theorem 4.1. Let 0 &lt; &#961; &#8804; 1, 0 &lt; &lt; 0 , and &#948; &gt; . Let G be a set of n samples that is (k, , &#948;)stable (as in Definition 2.2) w.r.t. a centered distribution with covariance I + &#961;vv for an unknown k-sparse unit vector v &#8712; R d . Let S = (X i ) n i=1 be an -corrupted version of G . Algorithm 2 outputs a vector u such that uu -vv F = O( &#948;/&#961;).</p><p>Theorem 1.3 is an immediate corollary of Theorem 4.1.</p><p>We can apply Theorem 4.1 to get an end-to-end result for subgaussian distributions. Algorithm 2 requires the stability conditions (Definition 2.2) of the original good samples G . We show that these conditions are satisfied with a small number of samples. Lemma 4.2. Let 0 &lt; &#961; &#8804; 1 and 0 &lt; &lt; 0 . Let D be a centered subgaussian distribution with covariance I + &#961;vv for a k-sparse unit vector v &#8712; R d . Let G be a set of n = &#8486;(k 2 log d/&#948; 2 ) samples drawn from D. Then then with probability at least 1 -exp(-&#8486;(k 2 log d)), G is (k, , &#948;)stable (as in Definition 2.2) w.r.t. D for &#948; = O( log(1/ )).</p><p>Combining Theorem 4.1 and Lemma 4.2, given as input an -corrupted set of n = &#8486;(k 2 log d/ 2 ) samples drawn from a centered subgaussian distribution with covariance I + &#961;vv , Algorithm 2 returns a vector u with uu -vv F = O( log(1/ )/&#961;).</p><p>We defer the proofs of Lemma 4.2 and Theorem 4.1 to Appendix B and give an overview of the proof of Theorem 4.1.</p></div>			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>An implementation of our algorithms is available at https://github.com/guptashvm/Sparse-GD.36th Conference on Neural Information Processing Systems (NeurIPS</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>2022).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>For two sets of samples S and T , we say S is an -corrupted version of T if |S| = |T | and |S \ T | &#8804; |S|.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>Without loss of generality we can assume that is given to the algorithm. This is because we can run a binary search to determine : if our guess of is too small, then the algorithm will output a w whose objective value f (w) is much larger than it should be.</p></note>
		</body>
		</text>
</TEI>
