<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Distributed stochastic algorithms for high-rate streaming principal component analysis</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/30/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10401063</idno>
					<idno type="doi"></idno>
					<title level='j'>Transactions on machine learning research</title>
<idno>2835-8856</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Haroon Raja</author><author>Waheed U. Bajwa</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper considers the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples in streaming settings. The streaming rate of data in many contemporary applications can be high enough that a single processor cannot finish an iteration of existing methods for eigenvector estimation before a new sample arrives. This paper formulates and analyzes a distributed variant of the classical Krasulina's method (D-Krasulina) that can keep up with the high streaming rate of data by distributing the computational load across multiple processing nodes. The analysis improves upon the one in (Balsubramani et al., 2013) for the original Krasulina's method and shows that-under appropriate conditions-D-Krasulina converges to the principal eigenvector in an order-wise optimal manner; i.e., after receiving M samples across all nodes, its estimation error can be O(1/M ). In order to reduce the network communication overhead, the paper also develops and analyzes a mini-batch extension of D-Krasulina, which is termed DM-Krasulina. The analysis of DM-Krasulina shows that it can also achieve order-optimal estimation error rates under appropriate conditions, even when some samples have to be discarded within the network due to communication latency. Finally, experiments are performed over synthetic and real-world data to validate the convergence behaviors of D-Krasulina and DM-Krasulina in high-rate streaming settings. 2 Note that it is possible to improve the probability of success for Oja's rule, as derived in (Jain et al., 2016;Yang et al., 2018), to 1 -δ, δ ∈ (0, 1), by running O(log(1/δ)) instances of the algorithm and combining their outcomes. But such a strategy, which also adds to the computational and storage overhead because of the 'combine' step, is impractical for the streaming settings considered in this paper. Additional differences between the results of this paper pertaining to the centralized PCA problem for streaming data and that of (Jain et al., 2016;Yang et al., 2018) are discussed in Section 1.3. It is, however, important to point out that this work and (Jain et al., 2016;Yang et al., 2018) are analyzing two related, yet different, algorithmic approaches that are based on Krasulina's method and Oja's rule, respectively, and their results are therefore complementary in nature.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Dimensionality reduction and feature learning methods such as principal component analysis (PCA), sparse PCA, independent component analysis, and autoencoder form an important component of any machine learning pipeline. For data lying in a d-dimensional space, such methods try to find the k &#8810; d variables/features that are most relevant for solving an application-specific task <ref type="bibr">(e.g., classification, regression, estimation, data compression, etc.)</ref>. The focus of this work is on PCA, where the objective is to compute k-features that capture most of the variance in data. The proliferation of big data (both in terms of dimensionality and number of samples) has resulted in an increased interest in developing new algorithms for PCA due to the fact that classical numerical solutions (e.g., power iteration and Lanczos method <ref type="bibr">(Golub &amp; Van Loan, 2012)</ref>) for computing eigenvectors of symmetric matrices do not scale well with high dimensionality and large sample sizes. The main interest in this regard has been on developing algorithms that are cheap in terms of both memory and computational requirements as a function of dimensionality and number of data samples.</p><p>In addition to high dimensionality and large number of samples, another defining characteristic of modern data is their streaming nature in many applications; examples of such applications include the internet-ofthings, high-frequency trading, meteorology, video surveillance, autonomous vehicles, social media analytics, etc. Several stochastic methods have been developed in the literature to solve the PCA problem in streaming settings <ref type="bibr">(Krasulina, 1969;</ref><ref type="bibr">Oja &amp; Karhunen, 1985;</ref><ref type="bibr">Sanger, 1989;</ref><ref type="bibr">Warmuth &amp; Kuzmin, 2007;</ref><ref type="bibr">Zhang &amp; Balzano, 2016)</ref>. These methods operate under the implicit assumption that the data arrival rate is slow enough so that each sample can be processed before the arrival of the next one. But this may not be true for many modern applications involving high-rate streaming data. To overcome this obstacle corresponding to highrate streaming data, this paper proposes and analyzes distributed and distributed, mini-batch variants of the classical Krasulina's method <ref type="bibr">(Krasulina, 1969)</ref>. Before providing details of the proposed methods and their relationship to prior work, we provide a brief overview of the streaming PCA problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Principal Component Analysis (PCA) from Streaming Data</head><p>For data lying in R d , PCA learns a k-dimensional subspace with maximum data variance. Let x &#8712; R d be a random vector that is drawn from some unknown distribution P x with zero mean and &#931; covariance matrix. For the constraint set V := {V &#8712; R d&#215;k : V T V = I}, we can pose PCA as the following constrained optimization problem:</p><p>where Tr(.) denotes the trace operator. The solution for the statistical risk maximization problem (1) is the matrix Q * with top k eigenvectors of &#931;. In practice, however, (1) cannot be solved in its current form since P x is unknown. But if we have T data samples, {x t } T t=1 , drawn independently from P x , then we can accumulate these data samples to calculate the sample covariance matrix as:</p><p>where A t := x t x T t . Instead of solving (1), we can now solve an empirical risk maximization problem</p><p>Tr(V T A t V).</p><p>(3)</p><p>In principle, we can solve (3) by computing the singular value decomposition (SVD) of sample covariance &#256;T . But this is a computationally intensive task that requires O(d 3 ) multiplications and that has a memory overhead of O(d 2 ). In contrast, the goal in high-dimensional PCA problems is often to have O(d 2 k) computational complexity and O(dk) memory complexity <ref type="bibr">(Li et al., 2016)</ref>.</p><p>More efficient (and hence popular) approaches for PCA use methods such as the power/orthogonal iteration and Lanczos method <ref type="bibr">(Golub &amp; Van Loan, 2012, Chapter 8)</ref>. Although these methods improve overall computational complexity of PCA to O(d 2 k), they still have memory requirements on the order of O(d 2 ). In addition, these are batch methods that require computing the sample covariance matrix &#256;T , which results in O(d 2 T ) multiplication operations. Further, in streaming settings where the goal is real-time decision making from data, it is infeasible to compute &#256;T . Because of these reasons, stochastic approximation methods such as Krasulina's method <ref type="bibr">(Krasulina, 1969</ref>) and Oja's rule <ref type="bibr">(Oja &amp; Karhunen, 1985)</ref> are often favored for the PCA problem. Both these are simple and extremely efficient algorithms, achieving O(d) computational and memory complexity per iteration, for computing the principal eigenvector (i.e., k = 1) of a covariance matrix in streaming settings. Recent years in particular have seen an increased popularity of these algorithms and we will discuss these recent advances in Section 1.3. Both Oja's rule and Krasulina's method share many similarities. In this paper, we focus on Krasulina's method with the understanding that our findings can be mapped to Oja's rule through some tedious but straightforward calculations. Using t for algorithmic iteration, Krasulina's method estimates the top eigenvector by processing one data sample in each iteration as follows:</p><p>where &#947; t denotes the step size. Going forward, we will be using A t in place of x t x T t in expressions such as (4) for notational compactness. In practice, however, one should neither explicitly store A t nor explicitly use it for calculation purposes.</p><p>Note that one can interpret Krasulina's method as a solution to an optimization problem. Using Courant-Fischer Minimax Theorem <ref type="bibr">(Golub &amp; Van Loan, 2012, Theorem 8.1.2)</ref>, the top eigenvector computation (i.e., 1-PCA, which is the k = 1 version of (1)) can be posed as the following optimization problem:</p><p>(5)</p><p>In addition, the gradient of the function f (v) defined in ( <ref type="formula">5</ref>) is:</p><p>Looking at (4)-( <ref type="formula">6</ref>), we see that ( <ref type="formula">4</ref>) is very similar to applying stochastic gradient descent (SGD) to the nonconvex problem (5), with the only difference being the scaling factor of 1/&#8741;v&#8741; 2 2 . Nonetheless, since (5) is a nonconvex problem and we are interested in global convergence behavior of Krasulina's method, existing tools for analysis of the standard SGD problem <ref type="bibr">(Bottou, 2010;</ref><ref type="bibr">Recht et al., 2011;</ref><ref type="bibr">Dekel et al., 2012;</ref><ref type="bibr">Reddi et al., 2016b;</ref><ref type="bibr">c)</ref> do not lend themselves to the fastest convergence rates for Krasulina's method. Despite its nonconvexity, however, (5) has a somewhat benign optimization landscape and a whole host of algorithmic techniques and analytical tools have been developed for such structured nonconvex problems in recent years that guarantee fast convergence to a global solution. In this paper, we leverage some of these recent developments to guarantee near-optimal global convergence of two variants of Krasulina's method in the case of high-rate streaming data. Before proceeding further, it is worth noting that while Krasulina's method primarily focuses on the 1-PCA problem, it can be used to solve the k-PCA problem. But such an indirect approach, which involves repeated use of the Krasulina's method k times, can be inefficient in terms of sample complexity <ref type="bibr">(Allen-Zhu &amp; Li, 2017a</ref>, Section 1). We leave investigation of a near-optimal direct method for the k-PCA problem involving high-rate streaming data for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Our Contributions</head><p>In this paper, we propose and analyze two distributed variants of Krasulina's method for estimating the top eigenvector of a covariance matrix from fast streaming, independent and identically distributed (i.i.d.) data samples. Our theoretical analysis, as well as numerical experiments on synthetic and real data, establish near-optimality of the proposed algorithms. In particular, our analysis shows that the proposed algorithms can achieve the optimal convergence rate of O(1/M ) for 1-PCA after processing a total of O(M ) data samples (see <ref type="bibr">(Jain et al., 2016, Theorem 1.1)</ref> and <ref type="bibr">(Allen-Zhu &amp; Li, 2017a, Theorem 6)</ref>). In terms of details, following are our key contributions: 1. Our first contribution corresponds to the scenario in which there is a mismatch of N &#8712; Z + &gt; 1 between the data streaming rate and the processing capability of a single processor, i.e., one iteration of Krasulina's method on one processor takes as long as N data arrival epochs. Our solution to this problem, which avoids discarding of samples, involves splitting the data stream into N parallel streams that are then input to N interconnected processors. Note that this splitting effectively reduces the streaming rate at each processor by a factor of N . We then propose and analyze a distributed variant of Krasulina's method-termed D-Krasulina-that solves the 1-PCA problem for this distributed setup consisting of N processing nodes. Our analysis substantially improves the one in <ref type="bibr">(Balsubramani et al., 2013)</ref> for Krasulina's method and shows that D-Krasulina can result in an improved convergence rate of O(1/N t) after t iterations (Theorem 1), as opposed to the O(1/t) rate for the classical Krasulina's method at any one of the nodes seen in isolation. Establishing this result involves a novel analysis of Krasulina's method that brings out the dependence of its convergence rate on the variance of the sample covariance matrix; this analysis coupled with a variance reduction argument leads to the convergence rate of O(1/N t) for D-Krasulina under appropriate conditions.</p><p>2. Mini-batching of data samples has long been promoted as a strategy in stochastic methods to reduce the wall-clock time. Too large of a mini-batch, however, can have an adverse effect on the algorithmic performance; see, e.g., <ref type="bibr">(Shamir &amp; Srebro, 2014, Sec. VIII)</ref>. One of the challenges in mini-batched stochastic methods, therefore, is characterizing the mini-batch size that leads to nearoptimal convergence rates in terms of the number of processed samples. In <ref type="bibr">(Agarwal &amp; Duchi, 2011;</ref><ref type="bibr">Cotter et al., 2011;</ref><ref type="bibr">Dekel et al., 2012;</ref><ref type="bibr">Shamir &amp; Srebro, 2014;</ref><ref type="bibr">Ruder, 2016;</ref><ref type="bibr">Golmant et al., 2018;</ref><ref type="bibr">Goyal et al., 2017)</ref>, for example, the authors have focused on this challenge for the case of mini-batch SGD for convex and nonconvex problems. In the case of nonconvex problems, however, the guarantees only hold for convergence to first-order stationary points. In contrast, our second contribution is providing a comprehensive understanding of the global convergence behavior of minibatched Krasulina's method. In fact, our analysis of D-Krasulina is equivalent to that of a minibatch (centralized) Krasulina's method that uses a mini-batch of N samples in each iteration. This analysis, therefore, guarantees near-optimal convergence rate with arbitrarily high probability for an appropriately mini-batcheded Krasulina's method in a centralized setting. This is in contrast to <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref>, where the focus is on Oja's rule and the probability of success is upper bounded by 3/4 for a single algorithmic run. 2 In addition, in the case of highrate streaming data that requires splitting the data stream into N parallel ones, we characterize the global convergence behavior of a mini-batch generalization of D-Krasulina-termed DM-Krasulinain terms of the mini-batch size. This involves specifying the conditions under which mini-batches of size B/N per node can lead to near-optimal convergence rate of O(1/Bt) after t iterations of DM-Krasulina (Theorem 2). An implication of this analysis is that for a fixed (network-wide) sample budget of T samples, DM-Krasulina can achieve O(1/T ) rate after t := T /B iterations provided the (network-wide) mini-batch size B satisfies B = O(T 1-2 c 0 ) for some constant c 0 &gt; 2 (Corollary 1).</p><p>3. Our next contribution is an extended analysis of DM-Krasulina that concerns the scenario where (computational and/or communication) resource constraints translate into individual nodes still receiving more data samples than they can process in one iteration of DM-Krasulina. This resourceconstrained setting necessitates DM-Krasulina dropping &#181; &#8712; Z + samples across the network in each iteration. Our analysis in this setting shows that such loss of samples need not result in sub-optimal performance. In particular, DM-Krasulina can still achieve near-optimal convergence rate as a function of the number of samples arriving in the network-for both infinite-sample and finite-sample regimes-as long as &#181; = O(B) (Corollary 2). 4. We provide numerical results involving both synthetic and real-world data to establish the usefulness of the proposed algorithms, validate our theoretical analysis, and understand the impact of the number of dropped samples per iteration of DM-Krasulina on the convergence rate. These results in particular corroborate our findings that increasing the mini-batch size improves the performance of DM-Krasulina up to a certain point, after which the convergence rate starts to decrease. Since the focus of this work is on systems theory, the reported results do not focus on some of the largescale system implementation issues such as unexpected processor failures, high network coordination costs, etc. Such large-scale implementation issues, while relevant from a practical perspective, are beyond the scope of this paper and provide interesting research directions for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Related Work</head><p>Solving the PCA problem efficiently in a number of settings has been an active area of research for decades. <ref type="bibr">(Krasulina, 1969;</ref><ref type="bibr">Oja &amp; Karhunen, 1985)</ref> are among the earliest and most popular methods to solve PCA in streaming data settings. Several variants of these methods have been proposed over the years, including <ref type="bibr">(Bin Yang, 1995;</ref><ref type="bibr">Chatterjee, 2005;</ref><ref type="bibr">Doukopoulos &amp; Moustakides, 2008)</ref>. Like earlier developments in stochastic approximation methods <ref type="bibr">(Robbins &amp; Monro, 1951)</ref>, such variants were typically shown to converge asymptotically. Convergence rate analyses for stochastic optimization in finite-sample settings <ref type="bibr">(Shapiro &amp; Homem-de Mello, 2000;</ref><ref type="bibr">Linderoth et al., 2006)</ref> paved the way for non-asymptotic convergence analysis of different variants of the stochastic PCA problem, which is fundamentally a nonconvex optimization problem. Within the context of such works, the results that are the most relevant to the algorithmic strategy devised in this paper can be found in <ref type="bibr">(Jain et al., 2016)</ref> and <ref type="bibr">(Yang et al., 2018)</ref>. The authors in these two papers provide variance-dependent convergence guarantees for Oja's rule in the finite-sample regime, thereby making their results also translatable to the algorithmic framework being considered in this paper for high-rate streaming PCA. However, as noted earlier, the results derived in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref> only hold with probability 3/4, which is in contrast to the high-probability results of this paper. And while one could increase the probability of success in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref> through multiple algorithmic runs, this is not a feasible strategy in the streaming settings.</p><p>In order to complement the pioneering results of <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref> in streaming settings, we shift our focus away from Oja's rule as the base algorithm and develop a different proof strategy that substantially extends and generalizes the analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref> for Krasulina's method. The analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref> assumes that the &#8467; 2 norm of the data samples is bounded by a positive constant, but it does not take into account the variance of the sample covariance matrices. Such an analysis leads to convergence results that are independent of the variance and hence are unable to capture any improvement in the convergence rate due to mini-batching and/or distributed implementations for computing the top eigenvector of a covariance matrix. We overcome this limitation of the analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref> by developing a proof in this work that explicitly accounts for the variance of the sample covariance matrices. Note that this extension/generalization of the analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref>-despite the intuitive nature of our final set of results-is a nontrivial task. There are in particular two main technical challenges that are addressed in this paper: (i) Using concentration of measure results that allow for incorporation of the variance within the analysis, as opposed to the Hoeffding inequality <ref type="bibr">(Boucheron et al., 2013)</ref> utilized in <ref type="bibr">(Balsubramani et al., 2013)</ref>; and (ii) Utilizing the variance-dependent concentration guarantees within two terms in Lemma 1, one of which depends on the norm bound on data samples and the other on the variance-a careful decoupling of these two quantities being critical to obtain convergence results in which the dominant term depends only on the variance. Finally, another aspect that distinguishes this paper from prior works such as <ref type="bibr">(Balsubramani et al., 2013;</ref><ref type="bibr">Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref> is that it provides a formal framework for studying the communications and computation tradeoffs involved in solving the 1-PCA problem in distributed streaming settings. This framework is described in detail in Section 3.2, with theoretical characterization of the proposed framework in terms of the communications and computation costs described in Section 4.2.</p><p>Because of the vastness of literature on (stochastic) PCA, this work is also tangentially or directly related to a number of additional prior works. We review some of these works in the following under the umbrellas of different problem setups, with the understanding that the resulting lists of works are necessarily incomplete. Much of our discussion in the following focuses on solving the PCA problem in (fast) streaming and distributed data settings, which is the main theme in this paper.</p><p>Sketching for PCA. Sketching methods have long been studied in the literature for solving problems involving matrix computations; see <ref type="bibr">(Woodruff, 2014)</ref> for a review of such methods. The main idea behind these methods is to compress data using either randomized or deterministic sketches and then perform computations on the resulting low-dimensional data. While sketching has been used as a tool to solve the PCA problem in an efficient manner (see, e.g., <ref type="bibr">(Warmuth &amp; Kuzmin, 2007;</ref><ref type="bibr">Halko et al., 2011;</ref><ref type="bibr">Liberty, 2013;</ref><ref type="bibr">Leng et al., 2015;</ref><ref type="bibr">Karnin &amp; Liberty, 2015)</ref>), the resulting methods cannot be used to exactly solve (1) in the fast streaming settings of this paper.</p><p>Online PCA. The PCA problem has also been extensively studied in online settings. While such settings also involve streaming data, the main goal in online PCA is to minimize the cumulative subspace estimation error over the entire time horizon of the algorithm. The online PCA framework, therefore, is especially useful in situations where either the underlying subspace changes over time or there is some adversarial noise in the sampling process. Some of the recent works in this direction include <ref type="bibr">(Garber et al., 2015;</ref><ref type="bibr">Allen-Zhu &amp; Li, 2017b;</ref><ref type="bibr">Garber, 2018;</ref><ref type="bibr">Marinov et al., 2018;</ref><ref type="bibr">Kot&#322;owski &amp; Neu, 2019)</ref>.</p><p>Stochastic convex optimization for PCA. One approach towards solving (3) in streaming settings is to relax the PCA problem to a convex optimization problem and then use SGD to solve the resulting stochastic convex optimization problem <ref type="bibr">(Arora et al., 2013;</ref><ref type="bibr">Garber &amp; Hazan, 2015;</ref><ref type="bibr">Nie et al., 2016)</ref>. The benefit of this approach is that now one can rely on rich literature for solving stochastic convex problems using SGD. But the tradeoff is that one now needs to store an iterate of dimension R d&#215;d , as opposed to an iterate of dimension R d&#215;k when we solve the PCA problem in its original nonconvex form. Due to these high memory requirements of O(d 2 ), we limit ourselves to solving PCA in the nonconvex form.</p><p>Streaming PCA and nonconvex optimization. The PCA problem in the presence of streaming data can also be tackled as an explicit constrained nonconvex optimization program <ref type="bibr">(Zhang &amp; Balzano, 2016;</ref><ref type="bibr">De Sa et al., 2015)</ref>. In <ref type="bibr">(Zhang &amp; Balzano, 2016)</ref>, for instance, the problem is solved as an optimization program over the Grassmannian manifold. The resulting analysis, however, relies on the availability of a good initial guess. In contrast, the authors in <ref type="bibr">(De Sa et al., 2015)</ref> analyze the use of the SGD for solving certain nonconvex problems that include PCA. The resulting approach, however, requires the step size to be a significantly small constant for eventual convergence (e.g., 10 -12 for the Netflix Prize dataset); this translates into slower convergence in practice.</p><p>Classical stochastic approximation methods for PCA. Recent years have seen an increased interest in understanding the global convergence behavior of classical stochastic approximation methods such as Krasulina's method <ref type="bibr">(Krasulina, 1969</ref>) and Oja's rule <ref type="bibr">(Oja &amp; Karhunen, 1985)</ref> for the PCA problem in nonasymptotic settings <ref type="bibr">(Allen-Zhu &amp; Li, 2017a;</ref><ref type="bibr">Chatterjee, 2005;</ref><ref type="bibr">Hardt &amp; Price, 2014;</ref><ref type="bibr">Shamir, 2015;</ref><ref type="bibr">2016;</ref><ref type="bibr">Jain et al., 2016;</ref><ref type="bibr">Li et al., 2016;</ref><ref type="bibr">Tang, 2019;</ref><ref type="bibr">Henriksen &amp; Ward, 2019;</ref><ref type="bibr">Amid &amp; Warmuth, 2019)</ref>. Some of these works, such as <ref type="bibr">(Shamir, 2015)</ref> and <ref type="bibr">(Shamir, 2016)</ref>, use variance reduction techniques to speed-up the algorithmic convergence. Such works, however, require multiple passes over the data, which makes them ill-suited for fast streaming settings. The analysis in <ref type="bibr">(Shamir, 2015)</ref> and <ref type="bibr">(Shamir, 2016)</ref> also requires an initialization close to the true subspace, which is somewhat unlikely in practice. Among other works, the authors in <ref type="bibr">(Allen-Zhu &amp; Li, 2017a)</ref> provide eigengap-free convergence guarantees for Oja's rule. Since the results in this work do not take into account the variance of data samples, they do not generalize to mini-batch/distributed streaming settings. As stated earlier, the authors in <ref type="bibr">(Jain et al., 2016)</ref> do provide variance-dependent convergence guarantees for Oja's rule, which makes this work the most relevant to ours.</p><p>In particular, the authors in <ref type="bibr">(Yang et al., 2018)</ref> have extended the initial analysis in <ref type="bibr">(Jain et al., 2016)</ref> to mini-batch settings. But a limitation of the analysis in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref> is that it guarantees convergence of Oja's rule only up to a probability of 3/4. And while the probability of success can be increased to 1 -&#948; by running and combining the outcomes of O(log(1/&#948;)) instances of Oja's rule, such an approach has two major drawbacks in streaming settings. First, since new data samples arrive continuously in a streaming setting, multiple runs of an algorithm in this case can only be achieved through multiple replicas of the processing system. Such a strategy, therefore, leads to a substantial increase in system costs. Second, the outcomes of the multiple runs need to be appropriately combined. In <ref type="bibr">(Jain et al., 2016)</ref>, it is suggested this be done by computing the geometric median of the multiple outcomes, which requires solving an additional optimization problem. This then adds to the computational and storage overhead for the PCA problem. We conclude by remarking on two key distinctions between our results and those in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref>. First, the arbitrarily high probability of success in our analysis requires the initial step size in Krasulina's method to decrease with an increase in the probability; we refer the reader to the discussion following Theorem 1 in this paper for further details on this point. Second, our convergence guarantees have the flavor of 'convergence in mean' as opposed to the 'convergence in probability' nature of the results in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref>. A straightforward application of Markov's inequality, however, leads to results that are directly comparable to the ones in <ref type="bibr">(Jain et al., 2016;</ref><ref type="bibr">Yang et al., 2018)</ref>; we refer the reader to Corollary 3 in Appendix D as an illustrative example of this.</p><p>Distributed PCA and streaming data. Several recent works such as <ref type="bibr">(Balcan et al., 2016;</ref><ref type="bibr">Boutsidis et al., 2016;</ref><ref type="bibr">Garber et al., 2017;</ref><ref type="bibr">De Sa et al., 2018)</ref> have focused on the PCA problem in distributed settings. Among these works, the main focus in <ref type="bibr">(Balcan et al., 2016;</ref><ref type="bibr">Boutsidis et al., 2016;</ref><ref type="bibr">Garber et al., 2017)</ref> is on improving the communications efficiency. This is accomplished in <ref type="bibr">(Balcan et al., 2016;</ref><ref type="bibr">Boutsidis et al., 2016)</ref> by sketching the local iterates and communicating the resulting compressed iterates to a central server in each iteration. In contrast, <ref type="bibr">(Garber et al., 2017)</ref> provides a batch solution in which every node in the network first computes the top eigenvector of its local (batch) covariance matrix and then, as a last step of the algorithm, all the local eigenvector estimates are summed up at a central server to provide an eigenvector estimate for the global covariance matrix. In contrast to these works, our focus in this paper is on establishing that distributed (mini-batch) variants of stochastic approximation methods such as Oja's rule and Krasulina's method can lead to improved convergence rates, as a function of the number of samples, for the PCA problem in fast streaming settings. In this regard, our work is more closely related to <ref type="bibr">(De Sa et al., 2018)</ref>, where the authors use the momentum method to accelerate convergence of power method and further extend their work to stochastic settings. However, the approach of <ref type="bibr">(De Sa et al., 2018)</ref> relies on a variance reduction technique that requires a pass over the complete dataset every once in a while; this is impractical in streaming settings. In addition, theoretical guarantees in <ref type="bibr">(De Sa et al., 2018)</ref> are based on the assumption of a "good" initialization; further, an implicit assumption in <ref type="bibr">(De Sa et al., 2018)</ref> is that inter-node communications is fast enough that there are no communication delays.</p><p>Decentralized PCA. Another important practical setting under which the PCA problem has been studied is when the data are distributed across an interconnected set of nodes that form a complete graph but not a fully connected one. We refer the reader to <ref type="bibr">(Blondel et al., 2005;</ref><ref type="bibr">Aysal et al., 2009;</ref><ref type="bibr">Khan et al., 2009;</ref><ref type="bibr">Dimakis et al., 2010)</ref> for a general understanding of this decentralized setting. Some contributions to the PCA problem in this setting are <ref type="bibr">(Kempe et al., 2008;</ref><ref type="bibr">Li et al., 2011;</ref><ref type="bibr">Korada et al., 2011;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">Gang et al., 2021;</ref><ref type="bibr">Gang &amp; Bajwa, 2021;</ref><ref type="bibr">2022)</ref>. Most of these contributions consider batch data settings and their extensions to the streaming setting are not evident. And while contributions such as <ref type="bibr">(Li et al., 2011)</ref> do consider the streaming data setting, the convergence rates established in such works are asymptotic in nature and the theoretical analyses cannot be used to derive conditions for a linear speed-up in convergence rates in the mini-batch setting.</p><p>Connections to low-rank matrix approximation. The low-rank matrix approximation problem <ref type="bibr">(Frieze et al., 2004;</ref><ref type="bibr">Clarkson &amp; Woodruff, 2017)</ref>, which involves computing a low-rank approximation of a given matrix, is closely related to the PCA problem. The overarching goal in both problems is the same: find a subspace that best approximates the data samples / given matrix. In the case of PCA, however, the focus is fundamentally on finding an orthogonal basis for the subspace that is precisely given by the top eigenvectors of the data covariance matrix. Notwithstanding this difference between the two problems, <ref type="bibr">(Yun et al., 2015;</ref><ref type="bibr">Tropp et al., 2019)</ref> are two works within the low-rank matrix approximation literature that are the most related to this paper. The setting in <ref type="bibr">(Yun et al., 2015)</ref> corresponds to a large-scale but fixed matrix whose columns are presented to a low-rank approximation algorithm in a streaming manner. This is in contrast to the streaming setting of this work that is akin to having a matrix with infinitely many columns. In addition, the algorithm being studied in <ref type="bibr">(Yun et al., 2015)</ref> requires computing the top principal component directions of a random submatrix of the larger matrix in a batch setting as its first step, which is again a departure from the streaming data setting of this work. Next, the mathematical model in <ref type="bibr">(Tropp et al., 2019)</ref> corresponds to a matrix that is given by the sum of a long sequence of low-rank and sparse matrices that are presented to a low-rank approximation algorithm in a streaming manner. This summation aspect of the data model in <ref type="bibr">(Tropp et al., 2019)</ref> does not coincide with the mathematical model of the streaming data samples in this work. Finally, and in stark contrast to this paper, neither of these works are concerned with the interplay between the streaming data rate, data processing rate, and the number of interconnected processors.</p><p>Connections to stochastic nonconvex optimization. Recent years have also seen an increased focus on understanding (variants of) SGD for general (typically unconstrained) stochastic nonconvex optimization problems. Among such works, some have focused on classical SGD <ref type="bibr">(Ge et al., 2015;</ref><ref type="bibr">Hazan et al., 2016;</ref><ref type="bibr">2017;</ref><ref type="bibr">Li et al., 2021;</ref><ref type="bibr">Mokhtari et al., 2017)</ref>, some have studied variance-reduction variants of SGD <ref type="bibr">(Reddi et al., 2016a;</ref><ref type="bibr">b;</ref><ref type="bibr">Qureshi et al., 2021)</ref>, and some have investigated accelerated variants of stochastic nonconvex optimization <ref type="bibr">(Allen-Zhu, 2018b;</ref><ref type="bibr">a)</ref>. In particular, works such as <ref type="bibr">(Reddi et al., 2016a;</ref><ref type="bibr">Allen-Zhu &amp; Hazan, 2016)</ref> are directly relevant to this paper since these works also use mini-batches to reduce sample variance and improve on SGD convergence rates. While (implicit, through the distributed framework, and explicit) mini-batching is one of the key ingredients of our work also, this paper differs from such related works because of its ability to prove convergence to a global optimum of the 1-PCA problem. In contrast, aforementioned works only provide guarantees for convergence to first-order stationary points of (typically unconstrained) stochastic nonconvex optimization problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Notational Convention and Paper Organization</head><p>We use lower-case (a), bold-faced lower-case (a), and bold-faced upper-case (A) letters to represent scalars, vectors, and matrices, respectively. Given a scalar a and a vector a, &#8968;a&#8969; denotes the smallest integer greater than or equal to a, while &#8741;a&#8741; 2 denotes the &#8467; 2 -norm of a. Given a matrix A, &#8741;A&#8741; 2 denotes its spectral norm and &#8741;A&#8741; F denotes its Frobenius norm. In addition, assuming A &#8712; R d&#215;d to be a positive semi-definite matrix, &#955; i (A) denotes its i-th largest eigenvalue, i.e., &#8741;A&#8741;</p><p>Whenever obvious from the context, we drop A from &#955; i (A) for notational compactness. Finally, E{&#8226;} denotes the expectation operator, where the underlying probability space (&#8486;, F, P) is either implicit from the context or is explicitly pointed out in the body.</p><p>The rest of this paper is organized as follows. We first provide a formal description of the problem and the system model in Section 2. The two proposed variants of Krasulina's method that can be used to solve the 1-PCA problem in fast streaming settings are then presented in Section 3. In Section 4, we provide theoretical guarantees for the proposed algorithms, while proofs / outlines of the proofs of the main theoretical results are provided in Section 5. Finally, numerical results using both synthetic and real-world data are presented in Section 6, while appendices are used for detailed proofs of some of the theoretical results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Formulation and System Model</head><p>Our goal is to use some variants of Krasulina's method (cf. ( <ref type="formula">4</ref>)) in order to obtain an estimate of the top eigenvector of a covariance matrix from independent and identically distributed (i.i.d.) data samples that are fast streaming into a system. The algorithms proposed in this regard and their convergence analysis rely on the following sets of assumptions concerning the data and the system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data Model</head><p>We consider a streaming data setting where a new data sample x t &#8242; &#8712; R d independently drawn from an unknown distribution P x arrives at a system at each sampling time instance t &#8242; . We assume a uniform data arrival rate of R s samples per second and, without loss of generality, take the data arrival index t &#8242; &#8805; 1 to be an integer. We also make the following assumptions concerning our data, which aid in our convergence analysis.</p><p>[A1] (Zero-mean, norm-bounded samples) Without loss of generality, the data samples have zero mean, i.e., E Px {x t &#8242; } = 0. In addition, the data samples are almost surely bounded in norm, i.e., &#8741;x t &#8242; &#8741; 2 &#8804; r, where we let the bound r &#8805; 1 without loss of generality.</p><p>[A2] (Spectral gap of the covariance matrix) The largest eigenvalue of</p><p>Note that both Assumptions [A1] and [A2] are standard in the literature for convergence analysis of Krasulina's method and Oja's rule (cf. <ref type="bibr">(Balsubramani et al., 2013;</ref><ref type="bibr">Oja &amp; Karhunen, 1985;</ref><ref type="bibr">Allen-Zhu &amp; Li, 2017a;</ref><ref type="bibr">Jain et al., 2016)</ref>).</p><p>We also associate with each data sample x t &#8242; a rank-one random matrix A t &#8242; := x t &#8242; x T t &#8242; , which is a trivial unbiased estimate of the population covariance matrix &#931;. We then define the variance of this unbiased estimate as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1 (Variance of sample covariance matrix). We define the variance of the sample covariance matrix</head><p>Note that all moments of the probability distribution P x exist by virtue of the norm boundedness of x t &#8242; (cf. Assumption [A1]). The variance &#963; 2 of the sample covariance matrix A t &#8242; as defined above, therefore, exists and is finite.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Splitter</head><p>x 1 , x 2 , . . .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = "</head><p>F 3 6 9 l 6 t T 6 s z 3 l r w c o 9 x + g P r K 9 f V f + a G A = = &lt; / l a t e x i t &gt;</p><p>x N , x 2N , . . .</p><p>A l e 4 A 3 e j W f j 1 f g w P m e t K 0 b h O Y I / M L 5 + A e z N n K U = &lt; / l a t e x i t &gt;</p><p>x 2 , x 2+N , . . .</p><p>x 1 , x N +1 , . . . &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " 4 o Z 5</p><p>A l e 4 A 3 e j W f j 1 f g w P m e t K 0 b h O Y I / M L 5 + A e z N n K U = &lt; / l a t e x i t &gt;</p><p>x 2 , x 2+N , . . .  The two algorithms proposed in this paper, namely, D-Krasulina and DM-Krasulina, are initialized with a random vector v 0 &#8712; R d that is randomly generated over the unit sphere in R d with respect to the uniform (Haar) measure. All analysis in this paper is with respect to the natural probability space (&#8486;, F, P) given by the stochastic process (v 0 , x 1 , x 2 , . . . ) and filtered versions of this probability space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">System Model</head><p>Let R p denote the number of data samples that a single processing node in the system can process in one second using an iteration of the form (4). <ref type="foot">3</ref> The focus of this paper is on the high-rate streaming setting, which corresponds to the setup in which the data arrival rate R s is strictly greater than the data processing rate R p . A naive approach to deal with this computation-streaming mismatch is to discard (per second) a fraction &#945; := R s /R p of samples in the system. Such an approach, however, leads to an equivalent reduction in the convergence rate by &#945;. We pursue an alternative to this approach in the paper that involves the simultaneous use of N &#8805; &#8968;&#945;&#8969; interconnected processors, each individually capable of processing R p samples per second, within the system. In particular, we advocate the use of such a network of N processors in the following two manners to achieve near-optimal convergence rates (as a function of the number of samples arriving at the system) for estimates of the top eigenvector of &#931; in high-rate streaming settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Distributed Processing Over a Network of Processors</head><p>We assume the fast data stream terminates into a data splitter, which splits the original stream with data rate R s samples per second into N parallel streams, each with data rate R s /N samples per second, that are then synchronously input to the interconnected network of N processors; see Figure <ref type="figure">1</ref>(a) for a schematic rendering of such splitting. In order to simplify notation in this setting, we reindex the data samples associated with the i-th processor / data stream in the following as {x i,t } t&#8712;Z+ , where the reindexing map (i, t) &#8594; t &#8242; is simply defined as t &#8242; = i + (t -1)N .</p><p>We also assume the network of processors implements some message passing protocol that allows it to compute sums of locally stored vectors, i.e., N i=1 a i for the set of local vectors {a i } N i=1 , within the network. This could, for instance, be accomplished using either Reduce or AllReduce primitives within most message passing implementations. We let R c (N ) denote the number of these primitive (sum) operations that the message passing protocol can carry out per second in the network of N processors. Note that, in addition to the number of nodes N , this parameter also depends upon the problem dimension d, message passing implementation, network topology, and inter-node communications bandwidth, all of which are being abstracted here through R c (N ). We will also be suppressing the dependence of the parameter R c on N within the notation beyond this section for ease of exposition.</p><p>Data splitting among this network of N processors effectively slows down the data streaming rate at each processing node by a factor of N . It is under this system model that we present a distributed variant of Krasulina's method, termed D-Krasulina, in Section 3.1 that involves two key operations after each round of splitting of the N samples: (i) per-processor computation of the form (4), which requires 1/R p seconds for completion, and (ii) a network-wide vector-sum operation, which requires an additional 1/R c (N ) seconds for completion. Under such an algorithmic framework, we can express the effective network-wide data processing rate, denoted by R e , in terms of the per-node data processing rate R p and the network-wide sum rate R c (N ) by noticing that the effective time it takes to process one sample within the network is given by</p><p>Here, the first algorithmic operation gives rise to the first term in the above expression, since it takes 1/R p seconds for computation of the form (4) for N samples, and the network-wide vector-sum operation gives rise to the second term. It then follows that the effective data processing rate is R e = 1/T e = N RpRc(N )</p><p>N Rp+Rc(N ) . In the high-rate streaming setting, D-Krasulina can only operate as long as the streaming rate R s does not exceed the effective processing rate R e , i.e., R s &#8804; R e . This gives rise to the following condition on the number of processors N that enables our algorithmic framework to cope with the high-rate streaming data:</p><p>Our developments in the following will be based on the assumption that the condition on N in ( <ref type="formula">7</ref>) is met.</p><p>The main analytical challenge for us is understanding the scenarios under which distributed processing over a network of N processors using D-Krasulina still yields near-optimal performance; we address this challenge in Section 4.1 by improving on the analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref> for Krasulina's method.</p><p>We conclude by remarking on the nature of the lower bound in (7) on the number of processors N , given that the bound itself is a function of N through its dependence on the parameter R c (N ) that is expected to decrease with an increase in N . The high-performance computing community has access to copious amounts of trace data for different parallel computing environments that allows one to relate the parameter R c (N ) for a given dimension d to the number of processors N ; see, e.g., <ref type="bibr">(Kavulya et al., 2010)</ref>. One such relationship could be, for instance, that R c (N ) &#8733; 1/N &#954; for some &#954; &#8712; (0, 1]. The final bound on the number of processors N can then be obtained by plugging such a relationship into the provided bound.<ref type="foot">foot_2</ref> </p><p>Remark 1. It is straightforward to see that our developments in this paper are also applicable to the setting in which data naturally arrives in a distributed manner at N different nodes, as in Figure <ref type="figure">1</ref>(b). In addition, our analysis of D-Krasulina is equivalent to that of a mini-batch Krasulina's method running on a powerfulenough single processor that uses a mini-batch of N samples in each iteration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Distributed Processing Coupled with Mini-batching</head><p>Mini-batching in (centralized) stochastic methods, as discussed in Section 1.2, helps reduce the wall-clock time by reducing the number of read operations per iteration. Mini-batching of samples in distributed settings has the added advantage of reduction in the average number of primitive (sum) operations per processed sample, which further reduces the wall-clock time. It is in this vein that we put forth a mini-batched variant of D-Krasulina, which is termed DM-Krasulina, in Section 3.2.</p><p>Similar to the case of D-Krasulina (cf. Figure <ref type="figure">1</ref>), there are several equivalent system models that can benefit from the DM-Krasulina framework. In keeping with our theme of fast streaming data, as well as for the sake of concreteness, we assume the system buffers (i.e., mini-batches) B := bN &#8805; &#8968;R s /R p &#8969; samples of the incoming data stream every B/R s seconds for some parameter b &#8712; Z + . This network-wide mini-batch of B samples is then split into N parallel (local) mini-batches, each comprising b = B/N samples, which are then synchronously input to the interconnected network of N processors at a rate of R s /N samples per second and collaboratively processed by DM-Krasulina. In each iteration t of DM-Krasulina, therefore, the network processes a total of B &#8805; N samples, as opposed to N samples for D-Krasulina. In order to simplify notation in this mini-batched distributed setting, we reindex the b data samples in the mini-batch associated with the i-th processor in iteration t of DM-Krasulina as {x i,j,t } j=b j=1,t&#8712;Z+ , where the reindexing map (i, j, t) &#8594; t &#8242; is defined as</p><p>It is straightforward to see from the discussion surrounding ( <ref type="formula">7</ref>) that the DM-Krasulina framework can process all data samples arriving at the system as long as N &#8805; bRcRs Rp(bRc-Rs) and (bR c -R s ) &gt; 0. However, when this condition is violated due to faster streaming rate R s , slower processing rate R p , slower summation rate R c , or any combination thereof, it becomes necessary for DM-Krasulina to discard &#181; := bRs Rp + N Rs</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rc</head><p>-B samples at the splitter per iteration. The main analytical challenges for DM-Krasulina are, therefore, twofold: first, assuming &#181; = 0, characterize the mini-batch size B that leads to near-optimal convergence rates for DM-Krasulina in terms of the total number of samples arriving at the system; second, when discarding of samples becomes necessary, characterize the interplay between B and &#181; that allows DM-Krasulina to still achieve (order-wise) near-optimal convergence rates. We address both these challenges in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Distributed Stochastic Algorithms</head><p>We now formally describe the two stochastic algorithms, termed D-Krasulina and DM-Krasulina, that can be used to solve the 1-PCA problem from high-rate streaming data under the two setups described in Section 2.2.1 and Section 2.2.2, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Distributed Krasulina's Method (D-Krasulina) for High-rate Streaming Data</head><p>Recall from the discussion in Section 2.2.1 that each node i in the network receives data sample x i,t in iteration t of the distributed implementation, which comprises N processing nodes. Unlike the centralized Krasulina's method (cf. ( <ref type="formula">4</ref>)), therefore, any distributed variant of Krasulina's method needs to process N samples in every iteration t. Using A i,t as a shorthand for x i,t x T i,t , one natural extension of (4) that processes N samples in each iteration is as follows:</p><p>One natural question here is whether (8) can be computed within our distributed framework. The answer to this is in the affirmative under the assumption N &#8805; RsRc Rp(Rc-Rs) , with the implementation (termed D-Krasulina) formally described in Algorithm 1. <ref type="foot">5</ref>Notice that unlike classical Krasulina's method, which processes a total of t samples after t iterations, D-Krasulina processes a total of N t samples after t iterations in order to provide an estimate v t of the Algorithm 1 Distributed Krasulina's Method (D-Krasulina) Input: Incoming data streams at N processors, expressed as x i,t i.i.d.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8764; P</head><p>, and a step-size sequence {&#947; t &#8712; R + } t&#8712;Z+ Initialize: All processors initialize with v 0 &#8712; R d randomly generated over the unit sphere 1: for t = 1, 2, . . . , do 2: (In Parallel) Processor i receives data sample x i,t and updates &#958; i,t locally as follows:</p><p>in the network using a distributed vector-sum subroutine 4:</p><p>Update eigenvector estimate in the network as follows: v t &#8592; v t-1 + &#947; t &#958; t 5: end for Return: An estimate v t of the eigenvector q * of &#931; associated with &#955; 1 (&#931;) top eigenvector q * of &#931;. Another natural question, therefore, is whether the estimate v t returned by D-Krasulina can converge to q * at the near-optimal rate of O (1/# of processed samples). Convergence analysis of D-Krasulina in Section 4 establishes that the answer to this is also in the affirmative under appropriate conditions that are specified in Theorem 1. An important interpretation of this result is that our proposed distributed implementation of Krasulina's method can lead to linear speed-up as a function of the number of processing nodes N in the network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Mini-batched D-Krasulina (DM-Krasulina) for High-rate Streaming Data</head><p>The distributed, mini-batched setup described in Section 2.2.2 entails each node i receiving a mini-batch of b = B/N data samples, {x i,j,t } b j=1 , in each iteration t, for a total of B = bN samples across the network in every iteration. Similar to (8), these B samples can in principle be processed by the following variant of the original Krasulina's iteration:</p><p>where A i,j,t is a shorthand for x i,j,t x T i,j,t . Practical computation of (9) within our distributed framework, however, requires consideration of two different scenarios.</p><p>&#8226; Scenario 1: The mini-batched distributed framework satisfies N &#8805; bRcRs Rp <ref type="bibr">(bRc-Rs)</ref> . This enables incorporation of every sample arriving at the system into the eigenvector estimate.</p><p>&#8226; Scenario 2: The mini-batched distributed framework leads to the condition N &lt; bRcRs Rp <ref type="bibr">(bRc-Rs)</ref> . This necessitates discarding of &#181; = bRs Rp + N Rs Rc -B samples per iteration in the system. Stated differently, the system receives B + &#181; samples per iteration in this scenario, but only B samples per iteration are incorporated into the eigenvector estimate.</p><p>We now formally describe the algorithm (termed DM-Krasulina) that implements (9) under both these scenarios in Algorithm 2. (In Parallel) Processor i receives data sample x i,j,t and updates &#958; i,t locally as follows:</p><p>end for 6:</p><p>Compute &#958; t &#8592; 1 B N i=1 &#958; i,t in the network using a distributed vector-sum subroutine 7:</p><p>Update eigenvector estimate in the network as follows:</p><p>The system (e.g., data splitter/buffer) receives (B + &#181;) additional samples during execution of Steps 2-7, out of which &#181; &#8712; Z + samples are discarded 10: end if 11: end for Return: An estimate v t of the eigenvector q * of &#931; associated with &#955; 1 (&#931;) per algorithmic iteration. While this makes DM-Krasulina particularly attractive for systems with slower communication links, the major analytical hurdle here is understanding the interplay between the different problem parameters that still allows DM-Krasulina to achieve near-optimal convergence rates in terms of the number of samples received at the system. We tease out this interplay as part of the convergence analysis of DM-Krasulina in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">A Note on the Processing of Non-centered and Non-i.i.d. Data</head><p>Both D-Krasulina and DM-Krasulina have been developed under the assumptions of zero-mean (i.e., centered) and i.i.d. data samples. In this section, we discuss one possible approach to handling non-centered data using the two algorithms and also provide a rationale for the applicability of D-Krasulina and DM-Krasulina in the face of any shifts in the data distribution.</p><p>In the case of non-centered data, one simple strategy that works for D-Krasulina and DM-Krasulina is to maintain a (network-wide) running average of the non-centered data samples, and then use it to center the data samples at each processor before applying Step 2 (resp., Step 4) in Algorithm 1 (resp., Algorithm 2). While such a modification requires an extension of the convergence analysis presented in the next section, this can be accomplished in a manner similar to the analytical extension in (Zhou &amp; Bai, 2021) for the centralized Oja's rule with non-centered data.</p><p>Next, while the forthcoming convergence analysis for D-Krasulina and DM-Krasulina has been provided under the assumption of i.i.d. data samples, the two algorithms are expected to remain effective in the non-i.i.d. data setting. This is because D-Krasulina and DM-Krasulina first essentially compute a new gradient-like quantity using the latest batch of data samples at each time t (cf. Step 2 in Algorithm 1 and Step 4 in Algorithm 2), and then update their respective eigenvector estimates using this quantity. In particular, any shifts in the data distribution can be tracked by the two algorithms because of such an update rule. It is because of this reason that algorithms such as Oja's rule and Krasulina's method are also often employed for the problem of subspace tracking (see, e.g., <ref type="bibr">(Bin Yang, 1995;</ref><ref type="bibr">Chatterjee, 2005;</ref><ref type="bibr">Doukopoulos &amp; Moustakides, 2008)</ref>). Since providing a formal analysis of such tracking capabilities of D-Krasulina and DM-Krasulina for non-i.i.d. data is beyond the scope of this paper, we leave it for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Convergence Analysis of D-Krasulina and DM-Krasulina</head><p>Our convergence analysis of D-Krasulina and DM-Krasulina is based on understanding the rate at which the so-called potential function &#936; t of these methods converges to zero as a function of the number of algorithmic iterations t. Formally, this potential function &#936; t is defined as follows.</p><p>Definition 2 (Potential function). Let q * be the eigenvector of &#931; associated with &#955; 1 (&#931;) and let v t be an estimate of q * returned by an iterative algorithm in iteration t. Then the quality of the estimate v t can be measured in terms of the potential function &#936; t : v t &#8594; [0, 1] that is defined as</p><p>Notice that &#936; t is a measure of estimation error, which approaches 0 as v t converges to any scalar multiple of q * . This measure, which essentially computes sine squared of the angle between q * and v t , is frequently used in the literature to evaluate the performance of PCA algorithms. In particular, when one initializes an algorithm with a random vector v 0 uniformly distributed over the unit sphere in R d then it can be shown that E{&#936; 0 } &#8804; 1 -1/d <ref type="bibr">(Balsubramani et al., 2013)</ref>. While this is a statement in expectation for t = 0, our analysis relies on establishing such a statement in probability for any t &#8805; 0 for both D-Krasulina and DM-Krasulina. Specifically, we show in Theorem 3 that sup t&#8805;0 &#936; t &#8804; 1 -O(1/d) with high probability as long as &#947; t = c/(L + t) for any constant c and a large-enough constant L.</p><p>All probabilistic analysis in the following uses a filtration (F t ) t&#8805;0 of sub &#963;-algebras of F on the sample space &#8486;, where the &#963;-algebra F t captures the progress of the iterates of the two proposed stochastic algorithms up to iteration t. Mathematically, let us define the sample covariance matrix A t as</p><p>j,t for D-Krasulina and DM-Krasulina, respectively. In order to simplify notation and unify some of the analysis of D-Krasulina and DM-Krasulina, we will be resorting to the use of random matrices A t , as opposed to x i,t and x i,j,t , in the following. We then have the following definition of &#963;-algebras in the filtration.</p><p>Definition 3 (&#963;-algebra F t ). The &#963;-algebra F t &#8838; F on sample space &#8486; for both D-Krasulina and DM-Krasulina is defined as the &#963;-algebra generated by the vector-/matrix-valued random variables (v 0 , A 1 , . . . , A t ), i.e., F t := &#963;(v 0 , A 1 , . . . , A t ).</p><p>In addition to the filtration (F t ) t&#8805;0 , the forthcoming analysis also uses a sequence of nested sample spaces that is defined as follows.</p><p>Definition 4 (Nested sample spaces). Let (t 0 , &#1013; 0 ), (t 1 , &#1013; 1 ), (t 2 , &#1013; 2 ), . . . , (t J , &#1013; J ) be a sequence of pairs such that 0 = t 0 &lt; t 1 &lt; t 2 &lt; . . . &lt; t J and &#1013; 0 &gt; &#1013; 1 &gt; &#1013; 2 &gt; . . . &gt; &#1013; J &gt; 0 for any non-negative integer J. We then define a sequence (&#8486;</p><p>Here, &#969; denotes an outcome within the sample space &#8486; and &#936; l (&#969;) is the (random) potential function after the l-th iteration of D-Krasulina / DM-Krasulina that is being explicitly written as a function of the outcomes &#969; in the sample space.</p><p>In words, the sample space &#8486; &#8242; t corresponds to that subset of the original sample space for which the error &#936; l in all iterations l &#8712; {t j , . . . , t -1} is below 1 -&#1013; j , where j &#8712; {0, . . . , J}. In the following, we use the notation E t {&#8226;} and P t (&#8226;) to denote conditional expectation and conditional probability, respectively, with respect to &#8486; &#8242; t . An immediate implication of Definition 4 is that, for appropriate choices of &#1013; j 's, it allows us to focus on those subsets of the original sample space that ensure convergence of iterates of the proposed algorithms to the top eigenvector q * at the desired rates. The main challenge here is establishing that such subsets have high probability measure, i.e., P &#8745; t&gt;0 &#8486; &#8242; t &#8805; 1 -&#948; for any &#948; &gt; 0. We obtain such a result in Theorem 4 in the following. We are now ready to state our main results for D-Krasulina and DM-Krasulina.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Convergence of D-Krasulina (Algorithm 1)</head><p>The first main result of this paper shows that D-Krasulina results in linear speed-up in convergence rate as a function of the number of processing nodes, i.e., the potential function for D-Krasulina converges to 0 at a rate of O(1/N t). Since the system receives a total of N t samples at the end of t iterations of D-Krasulina, this result establishes that D-Krasulina is order-wise near-optimal in terms of sample complexity for the streaming PCA problem. The key to proving this result is characterizing the convergence behavior of D-Krasulina in terms of variance of the sample covariance matrix A t := 1 N N i=1 A i,t that is implicitly computed within D-Krasulina. We denote this variance as &#963; 2 N , which has the form</p><p>It is straightforward to see from Definition 1 and ( <ref type="formula">12</ref>) that &#963; 2 N = &#963; 2 /N . This reduction in variance of the sample covariance matrix within D-Krasulina essentially enables the linear speed-up in convergence. In terms of specifics, we have the following convergence result for D-Krasulina.</p><p>Theorem 1. Fix any &#948; &#8712; (0, 1) and pick c := c 0 /2(&#955; 1 -&#955; 2 ) for any c 0 &gt; 2. Next, define </p><p>where C 1 and C 2 are constants defined as</p><p>Remark 2. While we can obtain a similar result for the case of c 0 &#8804; 2, that result does not lead to any convergence speed-up. In particular, the convergence rate in that case becomes O(t -c0/2 ), which matches the one in <ref type="bibr">(Balsubramani et al., 2013)</ref>.</p><p>Discussion. A proof of Theorem 1, which is influenced by the proof technique employed in <ref type="bibr">(Balsubramani et al., 2013)</ref>, is provided in Section 5. Here, we discuss some of the implications of this result, especially in relation to <ref type="bibr">(Balsubramani et al., 2013)</ref> and <ref type="bibr">(Jain et al., 2016)</ref>. The different problem parameters affecting the performance of stochastic methods for streaming PCA include: (i) dimensionality of the ambient space, d, (ii) eigengap of the population covariance matrix, (&#955; 1 -&#955; 2 ), (iii) upper bound on norm of the received data samples, r, and (iv) variance of the sample covariance matrix, &#963; 2 and/or &#963; 2 N . Theorem 1 characterizes the dependence of D-Krasulina on all these parameters and significantly improves on the related result provided in <ref type="bibr">(Balsubramani et al., 2013)</ref>.</p><p>First, Theorem 1 establishes D-Krasulina can achieve the convergence rate O(&#963; 2 N /t) &#8801; O(&#963; 2 /N t) with high probability (cf. ( <ref type="formula">14</ref>)). This is in stark contrast to the result in <ref type="bibr">(Balsubramani et al., 2013)</ref>, which is independent of variance of the sample covariance matrix, thereby only guaranteeing convergence rate of O(r 4 /t) for D-Krasulina and its variants. This ability of variants of Krasulina's methods to achieve faster convergence through variance reduction is arguably one of the most important aspects of our analysis. Second, in comparison with <ref type="bibr">(Balsubramani et al., 2013)</ref>, Theorem 1 also results in an improved lower bound on choice of L by splitting it into two quantities, viz., L 1 and L 2 (cf. ( <ref type="formula">13</ref>)). This improved bound allows larger step sizes, which also results in faster convergence. In terms of specifics, L 1 in the theorem is on the order of &#8486;(r 4 d/&#948; 2 ), which is an improvement over &#8486;(r 4 d 2 /&#948; 4 ) bound of <ref type="bibr">(Balsubramani et al., 2013)</ref>. On the other hand, while L 2 has same dependence on &#948; and d as <ref type="bibr">(Balsubramani et al., 2013)</ref>, it depends on &#963; 2 N instead of r 4 and, therefore, it reduces with an increase in N . Third, the improved lower bound on L also allows for an improved dependence on the dimensionality d of the problem. Specifically, for large enough t and N , the dependence on d in ( <ref type="formula">14</ref>) is due to the higher-order (first) term and is of the order O(d 5 2 ln 2 + c 0 2 ), as opposed to O(d 5 2 ln 2 +c0 ) for <ref type="bibr">(Balsubramani et al., 2013)</ref>. It is worth noting here, however, that this is still loser than the result in <ref type="bibr">(Jain et al., 2016)</ref> that has only log 2 (d) dependence on d in higher-order error terms.</p><p>Fourth, in terms of the eigengap, our analysis has optimal dependence of 1/(&#955; 1 -&#955; 2 ) 2 , which also matches the dependence in <ref type="bibr">(Balsubramani et al., 2013)</ref> and <ref type="bibr">(Jain et al., 2016)</ref>. It is important to note here, however, that knowledge of the eigengap (&#955; 1 -&#955; 2 ) is not necessary to run Oja's rule, Krasulina's method, D-Krasulina, or any of the related stochastic methods in a practical setting. Specifically, it can be seen from Theorem 1 that the eigengap is only needed to set the step size in D-Krasulina for the optimal convergence rate. In practice, however, step sizes of the form c/t work well for D-Krasulina and the related methods, and a simple yet highly effective strategy for setting the step size in these methods is to estimate the parameter c by running multiple instances of the method during a warm-up phase. Such an approach is akin to approximating several problem-related parameters using a single parameter c, and is the one we have followed for the numerical experiments discussed in Section 6.</p><p>Finally, we compare the recommended step-size sequence &#947; t = c/(L + t) in Theorem 1 to the ones in <ref type="bibr">(Balsubramani et al., 2013)</ref> and <ref type="bibr">(Jain et al., 2016)</ref>. Since the step sizes in these two prior works also take the form &#947; t = c/(L + t), all three works are equivalent to each other in terms of scaling of the step size as a function of t. But in terms of the initial step size, and assuming small enough &#948; in Theorem 1, &#947; 1 is the largest for <ref type="bibr">(Jain et al., 2016)</ref>, second-largest for this work, and the smallest for <ref type="bibr">(Balsubramani et al., 2013)</ref>. In relation to our work, this difference in the initial step size in the case of <ref type="bibr">(Balsubramani et al., 2013</ref>) is due to the improved lower bound on L in Theorem 1. In the case of <ref type="bibr">(Jain et al., 2016)</ref>, this difference is attributable to the fact that the parameter L is independent of &#948; in that work. Stated differently, we are able to vary the probability of success 1 -&#948; in this work by making the parameter L be a function of &#948;, with the caveat being that the initial step size &#947; 1 gets smaller as &#948; decreases. In contrast, a fixed L in <ref type="bibr">(Jain et al., 2016)</ref> can be thought of as one of the reasons the probability of success is fixed at 3/4 in that work. We conclude by noting that this dependence of the performance of D-Krasulina on different problem parameters is further highlighted through numerical experiments in Section 6.</p><p>Remark 3. While Theorem 1 is for (a distributed variant of) Krasulina's method, Oja's rule can also be analyzed using similar techniques; see, e.g., the discussion in <ref type="bibr">(Balsubramani et al., 2013)</ref>.</p><p>Remark 4. Recall from the discussion in Section 1.1 that an iteration of Krasulina's method is similar to that for SGD applied to the optimization problem (5). A natural question to ask then is whether D-Krasulina can be "accelerated" in much the same way SGD can be accelerated by adding a momentum term to its iteration expression. The authors in <ref type="bibr">(De Sa et al., 2018)</ref>, however, argue that naively applying momentum to Oja's rule or the power iteration, both of which are closely related to Krasulina's method, results in worst performance since this increases the effect of the noise within the iterates. And while the noise within the iterates can be controlled through variance reduction techniques, as done in <ref type="bibr">(De Sa et al., 2018)</ref> to accelerate the power iteration for eigenvector computation, such techniques typically require multiple data passes and are therefore not suited for the setting in which data samples continuously stream into the system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Convergence of DM-Krasulina (Algorithm 2)</head><p>The convergence analysis of DM-Krasulina follows from slight modifications of the proof of Theorem 1 for D-Krasulina. The final set of results, which covers the two scenarios of zero data loss (&#181; = 0) and some data loss (&#181; &gt; 0) in each iteration, is characterized in terms of variance of the (mini-batched) sample covariance</p><p>j,t associated with DM-Krasulina. We denote this variance as &#963; 2 B , which is given by</p><p>It is once again straightforward to see that &#963; 2 B = &#963; 2 /B. We now split our discussion of the convergence of DM-Krasulina according to the two scenarios discussed in Section 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Scenario 1-DM-Krasulina with no data loss:</head><p>Analytically, this scenario is similar to D-Krasulina, with the only difference being that we are now incorporating an average of B sample covariances x i,j,t x T i,j,t in the estimate in each iteration (as opposed to N sample covariances for D-Krasulina). We therefore have the following generalization of Theorem 1 in this scenario.</p><p>Theorem 2. Let the parameters and constants be as specified in Theorem 1, except that the parameter L 2 is now defined as</p><p>ln 4 &#948; . Then, as long as Assumptions [A1] and [A2] hold, we have for DM-Krasulina that P &#8745; t&gt;0 &#8486; &#8242; t &#8805; 1 -&#948; and</p><p>The proof of this theorem can be obtained from that of Theorem 1 by replacing 1/N and &#963; 2 N in there with 1/B and &#963; 2 B , respectively. Similar to the case of D-Krasulina, this theorem establishes that DM-Krasulina can also achieve linear speed-up in convergence as a function of the network-wide mini-batch size B with very high probability, i.e., E t {&#936; t } = O(&#963; 2 B /t) &#8801; O(&#963; 2 /Bt). Our discussions of D-Krasulina and DM-Krasulina have so far been focused on the infinite-sample regime, in which the number of algorithmic iterations t for both algorithms can grow unbounded. We now focus on the implications of our results for the finite-sample regime, in which a final estimate is produced at the end of arrival of a total of T &#8811; 1 samples. 6 This finite-sample regime leads to an interesting interplay between N (resp., B) and the total number of samples T for linear speed-up of D-Krasulina (resp., DM-Krasulina). We describe this interplay in the following for DM-Krasulina; the corresponding result for D-Krasulina follows by simply replacing B with N in this result.</p><p>Corollary 1. Let the parameters and constants be as specified in Theorem 2. Next, pick parameters </p><p>Proof. Substituting t = T B in ( <ref type="formula">16</ref>) and using simple upper bounds yield</p><p>Since &#963; 2 B = &#963; 2 /B and T B = T /B, (18) reduces to the following expression:</p><p>The proof now follows from the assumption that B &#8804; T 1-2 c0 .</p><p>6 An implicit assumption here is that T is large enough that it precludes the use of a batch PCA algorithm.</p><p>Discussion. Corollary 1 dictates that linear convergence speed-up for DM-Krasulina (resp., D-Krasulina) occurs in the finite-sample regime provided the network-wide mini-batch size B (resp., number of processing nodes N ) scales sublinearly with the total number of samples T . In particular, the proposed algorithms achieve the best (order-wise) convergence rate of O(1/T ) for appropriate choices of system parameters. We also corroborate this theoretical finding with numerical experiments involving synthetic and real-world data in Section 6. The statement of Theorem 2 for DM-Krasulina in the lossless setting immediately carries over to the resourceconstrained setting that causes loss of &#181; (&gt; 0) samples per iteration. The implication of this result is that DM-Krasulina can achieve convergence rate of O(1/Bt) in the infinite-sample regime after receiving a total of (B + &#181;)t samples. Therefore, it trivially follows that DM-Krasulina can achieve order-wise near-optimal convergence rate in the infinite-sample regime as long as &#181; = O(B).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Scenario</head><p>We now turn our attention to understanding the interplay between &#181;, B, and the total number of samples T arriving at the system for the resource-constrained finite-sample setting for DM-Krasulina. To this end, we have the following generalization of Corollary 1.</p><p>Corollary 2. Let the parameters and constants be as specified in Corollary 2, and define the final number of algorithmic iterations for DM-Krasulina as T &#181; B := T /(B + &#181;). Then, as long as Assumptions [A1] and [A2] hold, we have that</p><p>Proof. The proof of this corollary follows from replacing T B with T &#181; B in (18) and subsequently substituting the values of T &#181; B and &#963; 2 B in there.</p><p>Discussion. Recall that since the distributed framework receives a total of T samples, it is desirable to achieve convergence rate of O(1/T ). It can be seen from Corollary 2 that the first and the third terms in ( <ref type="formula">19</ref>) are the ones that dictate whether DM-Krasulina can achieve the (order-wise) optimal rate of O(1/T ).</p><p>To this end, the first term in ( <ref type="formula">19</ref>) imposes the condition (B + &#181;) &#8804; T 1-2/c0 , i.e., the total number of samples received at the system (both processed and discarded) per iteration must scale sublinearly with the final number of samples T . In addition, the third term in ( <ref type="formula">19</ref>) imposes the condition &#181; = O(B), i.e., the number of samples discarded by the system in each iteration must scale no faster than the number of samples processed by the system in each iteration. Once these two conditions are satisfied, Corollary 2 guarantees near-optimal convergence for DM-Krasulina.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Proof of the Main Result</head><p>The main result of this paper is given by Theorem 1, which can then be applied to any algorithm that (implicitly or explicitly) involves an iteration of the form (8). We develop a proof of this result in this section, which-similar to the approach taken in <ref type="bibr">(Balsubramani et al., 2013)</ref> for the analysis of Krasulina's method-consists of characterizing the behavior of D-Krasulina in three different algorithmic epochs. The result concerning the initial epoch is described in terms of Theorem 3 in the following, the behavior of the intermediate epoch, which comprises multiple sub-epochs, is described through Theorem 4, while the behavior of D-Krasulina in the final epoch is captured through a formal proof of Theorem 1 at the end of this section.</p><p>Before proceeding, recall that our result requires the existence of a sequence (&#8486; &#8242; t ) t&#8712;Z+ of nested sample spaces that are defined in terms of a sequence of pairs (t 0 &#8801; 0, &#1013; 0 ), (t 1 , &#1013; 1 ), . . . , (t J , &#1013; J ). Our analysis of the initial epoch involves showing that for the step size &#947; t chosen as in Theorem 1, the error for all t &#8805; 0 will be less than (1 -&#1013; 0 ) with high probability for some constant &#1013; 0 . We then define the remaining &#1013; j 's as &#1013; j = 2 j &#1013; 0 , j = 1, . . . , J, where J is defined as the smallest integer satisfying &#1013; J &#8805; 1/2. Our analysis in the intermediate epoch then focuses on establishing lower bounds on the number of iterations t j for which D-Krasulina is guaranteed to have the error less than 1 -&#1013; j with high probability. Stated differently, the intermediate epoch characterizes the sub-epochs {1 + t j-1 , t j } during which the error is guaranteed to decrease from (1 -&#1013; j-1 ) to (1 -&#1013; j ) with high probability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Initial Epoch</head><p>Our goal for the initial epoch is to show that if we pick the step size appropriately, i.e., we set L to be large enough (cf. ( <ref type="formula">13</ref>)), then the error, &#936; t , will not exceed a certain value with high probability. This is formally stated in the following result.</p><p>Theorem 3. Fix any &#948; &#8712; (0, 1), define &#1013; &#8712; (0, 1) as &#1013; := &#948; 2 /8e, and let</p><p>Then, if Assumptions [A1] and [A2] hold and we choose step size to be &#947; t = c/(L + t), we have</p><p>In order to prove Theorem 3 we need several helping lemmas that are stated in the following. We only provide lemma statements in this section and move the proofs to Appendix A. We start by writing the recursion of error metric &#936; t in the following lemma.</p><p>Lemma 1. Defining a scalar random variable</p><p>we get the following recursion:</p><p>Proof. See Appendix A.1.</p><p>Part (i) of this lemma will be used to analyze the algorithm in the final epoch for proof of Theorem 1, while Part (ii) will be used to prove Theorem 3 for this initial epoch and Theorem 4 for the intermediate phase.</p><p>Next we will bound the moment generating function of &#936; t conditioned on F t-1 (Definition 3). For this, we need an upper bound on conditional variance of z t , which is given below.</p><p>Lemma 2. The conditional variance of the random variable z t is given by</p><p>Proof. See Appendix A.2.</p><p>Using this upper bound on conditional variance of z t we can now upper bound the conditional moment generating function of &#936; t . In order to simplify notation, much of our discussion in the following will revolve around the moment generating function with parameter s &#8712; S := d/4&#1013;, (2/&#1013; 0 ) ln(4/&#948;) . Note, however, that similar results can be derived for any positive-valued parameter s &#8712; R.</p><p>Lemma 3. The conditional moment generating function of &#936; t for s &#8712; S is upper bounded as</p><p>Proof. See Appendix A.3.</p><p>Note that this result is similar to <ref type="bibr">(Balsubramani et al., 2013, Lemma 2.3</ref>) with the difference being that the last term here is sample variance, &#963; 2 N , as opposed to upper bound on input &#8741;x t &#8242; &#8741; 2 &#8804; r in <ref type="bibr">(Balsubramani et al., 2013, Lemma 2.3</ref>). This difference prompts changes in next steps of the analysis of D-Krasulina and it also enables us to characterize improvements in convergence rate of Krasulina's method using iterations of the form (8).</p><p>We are now ready to prove the statement of Theorem 3, which is based on Lemma 1 and 3.</p><p>Proof of Theorem 3. We start by constructing a supermartingale from sequence of errors &#936; t . First, restricting ourselves to s &#8712; S, we define quantities</p><p>Now, taking expectation of M t conditioned on the filtration F t-1 we get</p><p>&#8804; exp (s&#936; t-1</p><p>Here, (a) is due to Lemma 3 and using the fact that E{z t |F t-1 } &#8805; 0 <ref type="bibr">(Balsubramani et al., 2013, Theorem 2.1)</ref>. These calculations show that the sequence {M t } forms a supermartingale. Using sequence M t , we can now use Doob's martingale inequality <ref type="bibr">(Durrett, 2010, pg. 231)</ref> to show that &#936; t will be bounded away from 1 with high probability. Specifically, for any &#8710; &#8712; (0, 1), we have</p><p>Substituting &#8710; = 1 -&#1013;/d and using <ref type="bibr">(Balsubramani et al., 2013, Lemma 2.5)</ref> to bound Ee s&#936;0 we get</p><p>Next we need to bound l&gt;0 &#946; l and l&gt;0 &#950; l . First we get</p><p>Again using a similar procedure we get</p><p>Combining ( <ref type="formula">26</ref>) and ( <ref type="formula">27</ref>), along with the definition of &#964; t at the beginning, we get</p><p>Now using the lower bound on L, we get &#964; 0 &#8804; &#1013;/d for s = d/4&#1013; as shown in Proposition 4 in Appendix D. Substituting this in (25) we get</p><p>Finally, substituting s = d/4&#1013;, we get</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Intermediate Epoch</head><p>In Theorem 3 we have shown that if we choose L such that it satisfies the lower bound given in Theorem 3 then we have error &#936; t greater than 1 -&#1013; 0 (here, &#1013; 0 = &#948; 2 /8ed) with probability &#948;. Next, our aim is to show that if we perform enough iterations t J of D-Krasulina then for any t &#8805; t J the error in the iterate will be bounded by &#936; t &#8804; 1/2 with high probability. In order to prove this, we divide our analysis into different sub-epochs that are indexed by j &#8712; {1, . . . , J}. Starting from 1-&#1013; 0 , we provide a lower bound on the number of iterations t j such that we progressively increase &#1013; j in each sub-epoch until we reach &#1013; J .</p><p>Theorem 4. Fix any &#948; &#8712; (0, 1) and pick c := c 0 /2(&#955; 1 -&#955; 2 ) for any c 0 &gt; 2. Next, let the number of processing nodes N &gt; 1, the parameter L &#8805; 8r 4 max(1,c 2 )</p><p>ln 4 &#948; , and the step size &#947; t := c/(L+t). Finally, select a schedule (0, &#1013; 0 ), (t 1 , &#1013; 1 ), . . . , (t J , &#1013; J ) such that the following conditions are satisfied:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>and</head><p>[C2] t j+1 + L + 1 &#8805; e 5/c0 t j + L + 1 for 0 &#8804; j &lt; J.</p><p>In order to prove this theorem, we need Lemmas 4-7, which are stated as follows.</p><p>Lemma 4. For t &gt; t j , the moment generating function of</p><p>Proof. See Appendix B.1.</p><p>Lemma 5. For t &gt; t j and s &#8712; S, we have</p><p>Proof. See Appendix B.2.</p><p>Using Lemma 5, our next result deals with a specific value of t, namely, t = t j+1 .</p><p>Lemma 6. Suppose Conditions [C1]-[C2] are satisfied. Then for 0 &#8804; j &lt; J and s &#8712; S, we get</p><p>Then picking any 0 &lt; &#948; &lt; 1, we have</p><p>Proof. See Appendix B.4.</p><p>Proof. (Proof of Theorem 4) Using results from Lemma 7 and Theorem 3 and applying union bound, we get the statement of Theorem 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Final Epoch</head><p>Now that we have shown that &#936; t &#8804; 1/2 with probability 1 -&#948; for all t &#8805; t J , we characterize in the final epoch how &#936; t decreases further as a function of algorithmic iterations. The following result captures the rate at which &#936; t decreases during this final epoch.</p><p>Lemma 8. For any t &gt; t J and c := c 0 /(&#955; 1 -&#955; 2 ), the (conditional) expected error in &#936; t is given by</p><p>Proof. See Appendix C.</p><p>We are now ready to prove our main result, which is given by Theorem 1.</p><p>Proof. (Proof of Theorem 1) Recall the definitions of the sub-epochs corresponding to the pairs (t j , &#1013; j ) &#8242; s that satisfy the two conditions in Theorem 4. Following the same procedure as in the proof of <ref type="bibr">(Balsubramani et al., 2013, Theorem 1.1)</ref>, notice that</p><p>Defining</p><p>N , and using Lemma 8 for t &gt; t J , we have</p><p>Now using Proposition 1 from Appendix C with c 0 &gt; 2, we get</p><p>.</p><p>Here, the inequality in (a) is due to (30) and we have also used the fact that (1 + x) a &#8804; exp (ax) for x &lt; 1.</p><p>In addition, since (4ed/&#948; 2 ) (5/2 ln 2) &#8805; 1, we get</p><p>This completes the proof of the theorem.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Numerical Results</head><p>In this section, we utilize numerical experiments to validate the theoretical findings of this work in terms of the ability of implicit/explicit mini-batched variants of the original Krasulina's method <ref type="bibr">(Krasulina, 1969)</ref> to estimate the top eigenvector of a covariance matrix from (fast) streaming data. Instead of repeating the same set of experiments for the original Krasulina's method, D-Krasulina, and DM-Krasulina, we present our results that are parameterized by the network-wide mini-batch size B &#8712; {1} {bN : b &#8712; Z + } that appears in DM-Krasulina. This is because B = 1 trivially corresponds to the original Krasulina's iterations, while B = N corresponds to iterations that characterize D-Krasulina. Our goals for the numerical experiments are threefold: (i) showing the impact of (implicit/explicit) minibatching on the convergence rate of DM-Krasulina, (ii) establishing robustness of DM-Krasulina against the loss of &#181; &gt; 0 samples per iteration for the case when N &lt; bRcRs Rp(bRc-Rs) , and (iii) experimental validation for scaling of the convergence rate in terms of the problem parameters as predicted by our theoretical findings, namely, eigengap (&#955; 1 -&#955; 2 ), dimensionality (d), and upper bound on input samples (&#8741;x t &#8242; &#8741; 2 &#8804; r).</p><p>In the following, we report results of experiments on both synthetic and real-world data to highlight these points. Since the main purpose is to corroborate the scaling behaviors within the main results-and not to investigate additional system-related issues concerned with large-scale implementations-the real-world datasets are chosen to facilitate their processing on low-cost compute machines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Experiments on Synthetic Data</head><p>In the following experiments we generate T = 10 6 samples from some probability distribution (specified for each experiment later) and for each experiment we perform 200 Monte-Carlo trials. In all the experiments in the following we use step size of the form &#947; t = c/t. We performed experiments with multiple values of c and here we are reporting the results for the value of c which achieves the best convergence rate. Further details about each experiment are provided in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.1">Impact of mini-batch size on the performance of DM-Krasulina</head><p>For a covariance matrix &#931; &#8712; R 5&#215;5 with &#955; 1 = 1 and eigengap &#955; 1 -&#955; 2 = 0.2, we generate T = 10 6 samples from N (0, &#931;) distribution. The first set of experiments here deals with the resourceful regime, i.e., N &#8805; bRcRs Rp(bRc-Rs) , with mini-batches of sizes B &#8712; {1, 10, 100, 500, 1000, 2000}. Note that these values of B can be shows that the error &#936; T /(B+&#181;) for &#181; = 10 is comparable to that for &#181; = 0, but the error for &#181; = 200 is an order of magnitude worse than the nominal error.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2">Impact of the eigengap on the performance of DM-Krasulina</head><p>For this set of experiments, we again generate data in R 5 from a normal distribution N (0, &#931;), where the covariance matrix &#931; has the largest eigenvalue &#955; 1 = 1. We then vary the remaining eigenvalues to ensure an eigengap that takes values from the set {0.1, 0.2, 0.3, 0.4, 0.5}. The corresponding values of c that give the best convergence rate for each unique eigengap satisfy c &#8712; {180, 110, 90, 70, 60}. The final results for these experiments are plotted in Figure <ref type="figure">3</ref>(a) for the case of B = 1000 and &#181; = 0. These results establish that the final gap in error after observing T = 10 6 data samples is indeed on the order of O(1/(&#955; 1 -&#955; 2 ) 2 ), as suggested by the theoretical analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.3">Impact of dimensionality on the performance of DM-Krasulina</head><p>For this set of experiments, we generate data in R d from a normal distribution N (0, &#931;) whose dimensionality is varied such that d &#8712; {5, 10, 15, 20}. In addition, we fix the largest eigenvalue of &#931; to be &#955; 1 = 1 and its eigengap to be 0.2. The values of c corresponding to each unique value of d that provide the best convergence rate in these experiments satisfy {110, 110, 100, 100}; contrary to our theoretical analysis, this seems to suggest that the optimal step-size sequence does not have a strong dependence on d, at least for small values of d. We also plot the potential function for each d as a function of the number of received samples in Figure <ref type="figure">3</ref> the performance of DM-Krasulina on d. Both these observations suggest that our theoretical analysis is not tight in terms of its dependence on dimensionality d of the streaming data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.4">Impact of upper bound on the performance of DM-Krasulina</head><p>In order to understand the impact of the upper bound &#8741;x t &#8242; &#8741; 2 &#8804; r on the convergence behavior of DM-Krasulina, we generate x t &#8242; &#8712; R 5 as x t &#8242; = Cu t &#8242; with u t &#8242; &#8712; R 5 having independent entries drawn from uniform distribution U(-a, a) and C chosen to ensure an eigengap of 0.2 for the covariance matrix. As we vary the value of a within the set {1, 2, 3, 10}, we generate four different datasets of T = 10 6 samples for which the resulting r &#8712; {1.45, 2.9, 4.5, 14.5}. The values of c that provide best convergence for these values of r satisfy c &#8712; {8, 2, 1, 0.08}. The final set of results are displayed in Figure <ref type="figure">4</ref> for B = 1 and &#181; = 0. It can be seen from this figure that changing r does not affect the convergence behavior of DM-Krasulina. behavior can be explained by noticing that the parameter r appears in our convergence results in terms of a lower bound on L (cf. ( <ref type="formula">13</ref>)) and within the non-dominant term in the error bound. The dependence of L the parameter is being reflected here in our choice of the step-size parameter c that results in the best convergence result. In addition, we hypothesize that the non-dominant error term in our experiments, compared to the dominant one, is significantly small that it masks the dependence of the final error on r.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Experiments on Real-world Datasets</head><p>In this section, we evaluate the performance of DM-Krasulina on two real-world datasets, namely, the MNIST dataset <ref type="bibr">(LeCun, 1998)</ref> and the Higgs dataset <ref type="bibr">(Baldi et al., 2014)</ref>. The MNIST dataset corresponds to d = 784 and has a total of T = 6 &#215; 10 4 samples, while the Higgs dataset is d = 28 dimensional and comprises 1.1 &#215; 10 7 samples. It is worth noting here that since it is straightforward to store all the samples in these datasets at a single machine, one can always solve the 1-PCA problem for these datasets without resorting to the utilization of a distributed streaming framework. Nonetheless, it is still possible to utilize these dataset in a simulated distributed streaming setting in order to highlight the agreement between the scaling behavior predicted by our theoretical results and the scaling behavior observed using real-world datasets; this is indeed the purpose of the following sets of experiments.</p><p>Our first set of experiments is for the MNIST dataset, in which we use the step size &#947; = c/t with c &#8712; {0.6, 0.9, 1.1, 1.5, 1.6} for network-wide mini-batch sizes B &#8712; {1, 10, 100, 300, 1000} in the resourceful regime (&#181; = 0). The results, which are averaged over 200 random initializations and random shuffling of data, are given in Figure <ref type="figure">5</ref>(a). It can be seen from this figure that the final error relatively stays the same as B increases from 1 to 100, but it starts getting affected significantly as the network-wide mini-batch size is further increased to B = 300 and B = 1000. Our second set of experiments for the MNIST dataset corresponds to    the resource-constrained regime with (N, B) = (10, 100) and step-size parameter c &#8712; {0.6, 0.9, 1.1, 1.5, 1.6} for the number of discarded samples &#181; &#8712; {0, 10, 20, 40, 100}. The results, averaged over 200 trials and given in Figure <ref type="figure">5</ref>(b), show that the system can tolerate loss of some data samples per iteration without significant increase in the final error; the increase in error, however, becomes noticeable as &#181; approaches B. Both these observations are in line with the insights of our theoretical analysis.</p><p>We now turn our attention to the Higgs dataset. Our results for this dataset, averaged over 200 trials and using c = 0.07, for the resourceful and resource-constrained settings are given in Figure <ref type="figure">6</ref>(a) and Figure <ref type="figure">6</ref>(b), respectively. In the former setting, corresponding to B &#8712; {1, 10 2 , 10 3 , 10 4 , 2 &#215; 10 4 }, we once again see that the error relatively stays the same for values of B that are significantly smaller than T ; in particular, since T for the Higgs dataset is larger than for the MNIST dataset, it can accommodate a larger value of B without significant loss in performance. In the latter resource-constrained setting, corresponding to N = 10, B = 1000 and &#181; &#8712; {0, 10, 100, 1000, 2000}, we similarly observe that small (relative to B) values of &#181; do not impact the performance of DM-Krasulina in a significant manner. Once again, these results corroborate our research findings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper, we studied the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples. Our particular focus in here was developing and analyzing two variants, termed D-Krasulina and DM-Krasulina, of a classical stochastic algorithm that can estimate the top eigenvector in a near-optimal fashion from fast streaming data that overwhelms the processing capabilities of a single processor. Unlike the classical algorithm that must discard data samples in high-rate streaming settings, and thus sacrifice the convergence rate, the proposed algorithms manage the high-rate streaming data by trading off processing capabilities with computational resources and communications infrastructure. Specifically, both D-Krasulina and DM-Krasulina virtually slow down the rate of streaming data by spreading the processing of data samples across of a network of processing nodes. In addition, DM-Krasulina can overcome slower communication links and/or lack of sufficient number of processing nodes through a network-wide mini-batching strategy, coupled with discarding of a small number of data samples per iteration.</p><p>Our theoretical analysis, which fundamentally required a characterization of the error incurred by the proposed algorithms as a function of the variance of the sample covariance matrix, substantially improved the variance-agnostic analysis in <ref type="bibr">(Balsubramani et al., 2013)</ref> and established the conditions under which nearoptimal convergence rate is achievable in the fast streaming setting, even when some data samples need to be discarded due to lack of sufficient computational and/or communication resources. We also carried out numerical experiments on both synthetic and real-world data to validate our theoretical findings.</p><p>In terms of future work, extension of our algorithmic and analytical framework for estimation of the principal subspace comprising multiple eigenvectors remains an open problem. In addition, tightening our theoretical analysis to better elucidate the role of dimensionality of data in the performance of the proposed algorithmic framework is an interesting problem. Finally, investigation of additional practical issues (e.g., processor failures, variable compute costs, and network coordination costs) concerning processing of data in large-scale systems provides another avenue for future research.</p><p>Dejiao Zhang and Laura Balzano. Global convergence of a Grassmannian gradient descent algorithm for subspace estimation. In Proc. <ref type="bibr">Int. Workshop Artificial Intell. and Statist. (AISTATS), pp. 1460</ref><ref type="bibr">-1468</ref><ref type="bibr">, 2016</ref>.</p><p>Siyun Zhou and Yanqin Bai. Convergence analysis of Oja's iteration for solving online PCA with nonzeromean samples. Science China Mathematics, 64(4):849-868, 2021.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix A Proofs of Lemmas for the Initial Epoch</head><p>A.1 Proof of Lemma 1</p><p>In order to prove Lemma 1, we first need the following result. Lemma 9. The second moment of the update vector &#958; t in D-Krasulina is upper bounded as</p><p>Proof. We start by writing E &#8741;&#958; t -E{&#958; t }&#8741; 2 2 in terms of E &#8741;&#958; t &#8741; 2 2 as follows:</p><p>t }E{&#958; t } and rearranging the above equation, we get</p><p>Next, substituting value of &#958; t from (8) we get</p><p>Since &#931; is a positive semi-definite matrix, we can write its eigenvalue decomposition as &#931; = d i=1 &#955; i q i q T i , where &#955; 1 &gt; &#955; 2 &#8805; &#8226; &#8226; &#8226; &#8805; &#955; d &#8805; 0 and q 1 (&#8801; q * ), q 2 , . . . , q d are the eigenvalues and corresponding eigenvectors of &#931;, respectively. It follows that</p><p>Finally, we get from definition of &#936; t-1 that</p><p>This completes the proof of the lemma.</p><p>Using Lemma 9, we can now prove Lemma 1 in the following.</p><p>Proof of Lemma 1. From (10), we have</p><p>. Substituting v t from (8), we get</p><p>Here (a) and (b) are due to <ref type="bibr">(Balsubramani et al., 2013, Lemma A.1)</ref>, where (a) is true because v t-1 is perpendicular to &#958; t and (b) is true because &#8741;v t-1 &#8741; 2 &#8804; &#8741;v t &#8741; 2 . The second term in the above inequality can be bounded as</p><p>where (c) is due to Lemma 9. Substituting (34) in (33) completes the proof of Part (i) of Lemma 1. Next, we prove Part (ii) of the lemma by defining v t-1 = v t-1 /&#8741;v t-1 &#8741; 2 and noting that</p><p>Here, (d) is by using Cauchy-Schwartz inquality and the last inequality is due to Assumption [A1]. Now substituting this in (33) completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Proof of Lemma 2</head><p>We begin by writing</p><p>Substituting value of &#958; t in this, we get</p><p>where the last inequality in (36) is due to the fact that q * T vt-1 &#8741;vt-1&#8741;2 2 &#8804; 1. We can see that both the remaining terms in (36) are Rayleigh quotients of matrix (&#931; -1 N N i=1 A i,t ) and hence the largest eigenvalue of (&#931; -1 N N i=1 A i,t ) maximizes both the terms. Using this fact we get</p><p>Using (12), we get E{(z t -E{z t }) 2 |F t-1 } &#8804; 16&#947; 2 t &#963; 2 N , which completes the proof.</p><p>Note that the second inequality in ( <ref type="formula">41</ref>) is due to <ref type="bibr">(Balsubramani et al., 2013, Lemma 2.8)</ref> </p><p>Here, the last inequality is true because &#936; tj (&#969;) &#8804; 1 -&#1013; j for &#969; &#8712; &#8486; &#8242; tj +1 and 1 -x &#8804; e -x for x &#8804; 1. Next we bound the summations in (42) as follows:</p><p>Substituting these bounds in (42), we get the desired result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Proof of Lemma 6</head><p>This lemma uses Lemma 5 and deals with a specific value of t = t j+1 . For t = t j+1 , (29) gives</p><p>Using conditions [C1] and [C2] and the fact that e -2x &#8804; 1 -x for 0 &#8804; x &#8804; 3/4, we get</p><p>(1 -&#1013; j ) t j + L + 1 t j+1 + L + 1 c0&#1013;j &#8804; e -&#1013;j (e -5/c0 ) c0&#1013;j = e -6&#1013;j &#8804; 1 -3&#1013; j &#8804; 1 -&#1013; j+1 -&#1013; j .</p><p>Substituting this in (43), we obtain the desired result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 Proof of Lemma 7</head><p>Constructing a supermartingale sequence M t in the same way as we did in Theorem 3 for s &#8712; S and applying Doob's martingale inequality, we get  -&#1013;j )  .</p><p>Using Lemma 6 then results in</p><p>Taking expectation conditioned on F t-1 , we get</p><p>where the second term is due to Lemma 9. Now using upper bound on -E z t F t-1 from <ref type="bibr">(Balsubramani et al., 2013, Lemma A.4</ref>), we get following:</p><p>Finally, taking expectation over &#8486; &#8242; t , substituting &#947; t = c 0 /(2(t + L)(&#955; 1 -&#955; 2 )), and using the facts that &#8486; &#8242; t is F t-1 -measurable and for t &gt; t J , &#936; t-1 &#8804; 1/2 and we lie in sample space &#8486; &#8242; t with probability greater than 1 -&#948; (Theorem 3), we obtain</p><p>This completes the proof of the lemma.</p><p>Proposition 1. Let a 1 , b &gt; 0 and a 2 &gt; 1 be some constants. Consider a nonnegative sequence (u t : t &gt; t J ) that satisfies</p><p>Then we have:</p><p>Published in Transactions on Machine Learning Research (10/2022)</p><p>Proof. Recursive application of the bound on u t gives:</p><p>Using <ref type="bibr">(Balsubramani et al., 2013, Lemma D.1)</ref> we can bound the product terms as (i + L + 1) a2-2 .</p><p>Again applying <ref type="bibr">(Balsubramani et al., 2013, Lemma D.1)</ref>, we get the final result as follows</p><p>This completes the proof of the proposition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix D Other Auxiliary Results</head><p>Proposition 2 (Bennett's Inequality <ref type="bibr">(Boucheron et al., 2013)</ref>). Consider a zero-mean, bounded random variable X i &#8712; R (i.e., |X i | &#8804; h almost surely) with variance &#963; 2 i . Then for any s &#8712; R, we have E e sXi &#8804; exp &#963; 2 i s 2 e sh -1 -sh (sh) 2 .</p><p>Proposition 3. Let h := 2&#947; t r 2 and s &#8712; d/4&#1013;, (2/&#1013; 0 ) ln(4/&#948;) . It then follows that e sh -1-sh (sh) 2 &#8804; 1.</p><p>Proof. It is straightforward to see that e sh -1-sh (sh) 2</p><p>&#8804; 1 as long as sh &#8804; 7/4. Therefore, in order to prove this proposition, it suffices to show that the lower bound on L implies sh &#8804; 7/4 for s &#8712; d/4&#1013;, (2/&#1013; 0 ) ln(4/&#948;) . We establish this claim as two separate cases for the two values of s. Proof. We prove this by proving the following two statements:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Case</head><p>We start by proving the first statement: Proof. We begin by noting that</p><p>Next we prove the second statement as follows: Proof. The proof follows by first using Markov's inequality to bound the conditional probability</p><p>then using the bound on E t {&#936; t } in Theorem 1 to further bound the conditional probability by &#948;, and finally removing the conditioning on the nested sample spaces by using the facts that (i) P(A) &#8804; P(A|B) + P(B c ) for any two events A and B, and (ii)</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In contrast, the iterate of Oja's rule is given by vt= v t-1 + &#947;t xtx T t v t-1v T t-1 xtx T t v t-1 v t-1 .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>Note that the parameter Rp, among other factors, is a function of the problem dimension d; this dependence is being suppressed in the notation for ease of exposition.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>As an illustrative toy example, let us consider a scenario for a fixed d in which Rs = 10 3 sec -1 , Rp =</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="50" xml:id="foot_3"><p>sec -1 , and the parameter Rc(N ) scales as Rc(N ) = C R N sec -1 for a constant C R that depends on the message passing implementation, network topology, and inter-node communications bandwidth. In a "slow" network, corresponding to C R &#8804; 10 3 , there exists no N that satisfies the condition (7). In a "faster" network, corresponding to C R &gt; 10 3 N , a necessary condition on the number of processors is N &lt; 10 -3 C R . But while this necessary condition is satisfied for 1 &#8804; N &lt; 80 for C R as high as 8 &#215; 10 4 , there still does not exist any N that satisfies (7). In the case of even faster networks, however, we are able to find feasible values of N for running D-Krasulina without discarding any samples; e.g., any 28 &#8804; N &#8804; 72 satisfies the condition (7) for C R = 10 5 .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>Here, and in the following, the implicit assumption is that the quantity Rc -Rs (resp., bRc -Rs) is strictly positive for D-Krasulina (resp., DM-Krasulina).</p></note>
		</body>
		</text>
</TEI>
