<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Concentration inequalities for sums of Markov-dependent random matrices</title></titleStmt>
			<publicationStmt>
				<publisher>Journal of the IMA</publisher>
				<date>09/20/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10663163</idno>
					<idno type="doi">10.1093/imaiai/iaae032</idno>
					<title level='j'>Information and Inference: A Journal of the IMA</title>
<idno>2049-8772</idno>
<biblScope unit="volume">13</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Joe Neeman</author><author>Bobby Shi</author><author>Rachel Ward</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>We give Hoeffding- and Bernstein-type concentration inequalities for the largest eigenvalue of sums of random matrices arising from a Markov chain. We consider time-dependent matrix-valued functions on a general state space, generalizing previous results that had only considered Hoeffding-type inequalities, and only for time-independent functions on a finite state space. In particular, we study a kind of non-commutative moment generating function, provide tight bounds on this object and use a method of Garg etal. (A matrix expander Chernoff bound. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 1102–1114, New York, NY, USA, 2018. Association for Computing Machinery) to turn this into tail bounds. Our proof proceeds spectrally, bounding the norm of a certain perturbed operator. In the process we make an interesting connection to dynamical systems and Banach space theory to prove a crucial result on the limiting behaviour of our moment generating function that may be of independent interest.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The study of concentration inequalities has now become textbook material, with a variety of applications <ref type="bibr">[6]</ref>. Two of the most widely used and studied inequalities for scalar-valued random variables in the independent setting are Hoeffding's inequality <ref type="bibr">[28]</ref>, which operates under a boundedness assumption, and Bernstein's inequality <ref type="bibr">[5]</ref>, which operates under an additional bounded variance assumption.</p><p>With the wide usage of these inequalities, various generalizations have been made. One line of work has sought to relax the independence assumption, deriving concentration inequalities for sums of Markov-dependent random variables. Starting with the result of Gillman <ref type="bibr">[19]</ref>, improvements and refinements have been made to address the general setting of time-independent functions of a nonreversible Markov chain on a general state space <ref type="bibr">[15,</ref><ref type="bibr">26,</ref><ref type="bibr">33,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">43,</ref><ref type="bibr">54]</ref>, resulting in Hoeffding and Bernstein-type concentration inequalities that generalize nicely the independent setting; these results are largely spectral in nature.</p><p>Another line of work has sought to generalize the scalar setting to concentration inequalities of sums of independent random matrices, beginning with the work of Ahlswede and Winter <ref type="bibr">[2]</ref>. Several authors have followed the Ahlswede-Winter approach to develop analogs of Hoeffding's and Bernstein's inequalities for sums of random matrices; some works include <ref type="bibr">[8,</ref><ref type="bibr">24,</ref><ref type="bibr">52,</ref><ref type="bibr">53,</ref><ref type="bibr">57]</ref>. However, results using this framework often have variance proxies that are suboptimal. To this end, <ref type="bibr">[63,</ref><ref type="bibr">64]</ref> developed techniques to circumvent this barrier, allowing for concentration inequalities that are much tighter in many cases. These results are powerful and easy to use <ref type="bibr">[65,</ref><ref type="bibr">66]</ref>, with various applications <ref type="bibr">[13,</ref><ref type="bibr">45,</ref><ref type="bibr">47]</ref>.</p><p>We combine the above two lines of study in our work to develop concentration inequalities -Hoeffding and Bernstein-type -for the largest eigenvalue of sums of Markov dependent Hermitian random matrices, by combining the Ahlswede-Winter style argument for random matrices with the spectral techniques used to study Markov dependent scalar-valued random variables. The starting point is a deep result of Garg et al. <ref type="bibr">[16]</ref>, which provides a multi-matrix Golden-Thompson inequality that supports a spectral approach. Our results are broad and general: we provide inequalities for timedependent matrix-valued functions of a nonreversible Markov chain on general (continuous) state spaces. In this way, our Hoeffding-type bounds vastly generalize previous work and provide sharper constants, and our Bernstein-type bounds are new in the literature (Section 6). Along the way we prove a novel Perron-Frobenius type limit lemma for the matrix setting (Lemmas 5.4, 5.5), necessitated by our assumptions of nonversibility of the chain on continuous state spaces, that may be of independent interest.</p><p>Thus, this work can be seen as a first systematic study of concentration inequalities in the Markovdependent matrix setting. Each of our results involve two bounds: one bounding a noncommutative moment generating function of the form</p><p>and one bounding the tail probabilities</p><p>for Hermitian random matrices X 1 ,... ,X n arising as functions of a Markov chain. We expect our results on the former to be optimal, directly generalizing the scalar setting. These results are readily applicable to a wide variety of settings. For example, the best known guarantees for offline principal component analysis (PCA) are derived from a combination of the standard matrix Bernstein's inequality along with Wedin's perturbation theorem <ref type="bibr">[32,</ref><ref type="bibr">38,</ref><ref type="bibr">69]</ref>. Now, if the sample matrices are not independent, but instead arise from a Markov chain, we again have a bound for leading eigenvector estimation using our bounds. Our results are also directly applicable to sums of matrix-valued random variables sampled from a random walk on an expander; Theorem 2.5 improves on the best known results in this setting <ref type="bibr">[16,</ref><ref type="bibr">70]</ref>. There is also a line of work in machine learning that realizes stochastic gradient descent (SGD) with constant step size as a Markov chain with a stationary distribution <ref type="bibr">[4,</ref><ref type="bibr">11,</ref><ref type="bibr">48]</ref>. Additionally, Hessian information has been used widely to give improved algorithms and better understand model performance <ref type="bibr">[18,</ref><ref type="bibr">44]</ref>. Our results are able to give precise bounds on the convergence of the largest eigenvalue of the empirical mean of the Hessian matrices along the path of SGD to the largest eigenvalue of the expected Hessian with respect to the stationary measure. This has many interesting and relevant applications in statistical learning and generalization. More broadly, our bounds are useful in the study of online algorithms, where matrix-valued data is received and processed in a streaming fashion from an underlying Markov chain. In Section 7 we more closely focus on an application of our main results to offline PCA for samples arising as a function of an underlying Markov chain; we present a bound that is a direct generalization of the best known bound for offline PCA in the i.i.d. setting. This result was first stated in <ref type="bibr">[38]</ref>, where to our knowledge it was the first in the literature to give this sort of bound for offline Markov PCA ; it directly uses our Theorem 2.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Related work</head><p>In terms of results, the literature on concentration inequalities for sums of Markov-dependent random matrices is fairly small, in contrast to the scalar setting. Recent progress in this area builds on the work of <ref type="bibr">[16,</ref><ref type="bibr">60]</ref>, which develop a powerful Golden-Thompson inequality <ref type="bibr">[21,</ref><ref type="bibr">62]</ref> that allows for a spectral analysis of a useful moment generating function; the latter provides a corresponding Hoeffding's inequality for when the stationary distribution is uniform and the chain is stationary. The work of <ref type="bibr">[56]</ref> extends this to arbitrary stationary distributions and initial distributions; these works are valid only for finite state spaces and reversible chains. Our work generalizes these results to general state spaces and nonreversible chains using significantly different techniques, which allows us to also improve the constants in the bounds. Furthermore, our Hoeffding-type result is in terms of both the largest and smallest eigenvalues, matching the type given in the scalar setting; and we give a Bernstein's inequality, which is new in the literature.</p><p>In terms of techniques, most related to ours are the works of <ref type="bibr">[15,</ref><ref type="bibr">26,</ref><ref type="bibr">33,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">49,</ref><ref type="bibr">54]</ref>, which give Hoeffding's and Bernstein's inequalities for sums of Markov-dependent scalar random variables. Broadly speaking, the idea is to bound the largest norm of a perturbed operator. We prove the corresponding spectral properties of a Markov operator that is the tensored form of the operator appearing in these works in Section 5; in particular, we generalize to continuous state spaces using related functional analytic ideas. However, previous techniques are not sufficient for us to prove our main results; our proof of the result on the limit of the moment generating function (Lemma 5.4) takes a detour through dynamical systems theory and Banach space theory.</p><p>In contrast to the Markov-dependent setting, the sums of independent random matrices have been studied in depth, and is the model for which we give our bounds. In particular, our results mirror the type given in <ref type="bibr">[2,</ref><ref type="bibr">8,</ref><ref type="bibr">24]</ref> in terms of variance proxy, and the analysis proceeds from the same starting point: bounding a matrix moment generating function; in contrast, later works of <ref type="bibr">[52,</ref><ref type="bibr">63]</ref> sharpen the variance proxy. Our analysis of the moment generating function arising from the matrix-valued Golden-Thompson inequality result in bounds resembling more of the former; we believe our analysis is likely sharp given the form of the moment generating function, so an improvement in terms of the variance proxy would likely require a different approach and is an interesting open problem.</p><p>Besides the spectral approach, there have been a myriad of ways to derive concentration inequalities relaxing the independence assumption, both in the scalar and matrix settings. The papers of <ref type="bibr">[1,</ref><ref type="bibr">20]</ref> exploit regeneration-type minorization conditions to derive exponential concentration for ergodic sums; though these works do not assume a nonzero spectral gap they often have less explicit constants. The work of <ref type="bibr">[54]</ref> uses Marton couplings to obtain concentration inequalities for a scalar-valued function of a single random sample. The papers of <ref type="bibr">[46,</ref><ref type="bibr">55]</ref> leverage Efron-Stein inequalities and exchangeability to develop concentration inequalities for random matrices, also relaxing the independence assumption. Similarly, recent works of <ref type="bibr">[3,</ref><ref type="bibr">30,</ref><ref type="bibr">36]</ref> demonstrate that matrix functional inequalities, i.e., Poincar&#233;, directly translate to matrix concentration inequalities; <ref type="bibr">[36]</ref> develops a Bernstein's inequality for strongly Rayleigh distributions; <ref type="bibr">[39]</ref> gives a Chernoff bound for a similar setting. These largely address single sample concentration of (Lipschitz) functionals; however, some methods can be generalized to product measures <ref type="bibr">[31]</ref>, recovering some of the Efron-Stein results. And the work of <ref type="bibr">[16]</ref> sketches a method to reduce studying concentration of random variables sampled from a Markov chain to concentration of sums of martingale random variables <ref type="bibr">[9,</ref><ref type="bibr">63]</ref>; though the resulting bounds are suboptimal, they show that qualitatively concentration for Markov chains is more generic than just the scalar or matrix settings (see <ref type="bibr">[40]</ref>). These techniques may be very useful in improving the variance proxy, better matching the independent setting, where a direct spectral analysis may prove insufficient; in particular, functional inequalities have emerged as a powerful general tool to derive concentration. It may be interesting to see if this can be demonstrated in the scalar setting as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Main Results</head><p>In this section we give our main results. Each will hold under a combination of the following assumptions.</p><p>Assumption 2.1 P is a discrete-time Markov chain on a continuous state space X with stationary distribution &#181; and absolute spectral gap &#955; (Definition 3.1). s 1 ,... ,s n is a sequence of states driven by P with initial distribution &#181;. Assumption 2.2 F j : X &#8594; R d&#215;d , j = 1,... ,n, is a sequence of functions each mapping to real symmetric d &#215; d matrices. Each F j is &#8467; 2 (&#181; &#8855; 1) measurable (see Section 3). Assumption 2.3 For all j, E &#181; [F j (x)] = 0 and a j I F j (x) b j I for all x &#8712; X . Assumption 2.4 For all j, E &#181; [F j (x)] = 0 and there exists a constant V j such that E &#181; F j (x) 2 &#8771; V j for all x &#8712; X . Moreover, there exists an absolute constant M such that F j (x) &#8771; M for all j, x &#8712; X .</p><p>The following two theorems each consist of two statements. The first bounds a certain type of moment generating function arising from the expected Frobenius norm of a product of matrix exponentials. The second statement turns this into a tail bound. We only give upper tail bounds; the corresponding statement for lower tail bounds are clear and the proof proceeds in almost the exact same way.</p><p>Theorem 2.5 (Markov Matrix Hoeffding, Real) Instate Assumptions 2.1, 2.2, 2.3. Then for any &#952; &gt; 0,</p><p>and for any t &gt; 0,</p><p>Here &#945;(&#955; ) = (1 + &#955; )/(1 &#8656; &#955; ), where &#955; is the absolute spectral gap (Definition 3.1).</p><p>The first statement in the above result reveals the sub-Gaussian nature of the moment generating function; in the scalar case, it exactly reduces to the standard moment generating function. Using our methods, this bound matches the scalar case given in <ref type="bibr">[15]</ref>. The second statement provides a large deviation tail bound; the bound on the lower eigenvalue follows analogously. Our results offer four main improvements to that of <ref type="bibr">[56]</ref>: first, we give strictly better constants, both in terms of d and the absolute constants in the exponent (the &#945;(&#955; ) differs slightly from their equivalent term for the mixing of the Markov chain, but is strictly better and is more classical); second, we allow for time-dependent functions F j , which is a strict generalization; third, our inequality is in terms of both a lower and upper eigenvalue bound (the a j , b j parameters), giving a result that is directly comparable to the classic Hoeffding's lemma; and fourth, our bounds apply to general state spaces.</p><p>Furthermore, if &#952; &lt; (1 &#8656; &#955; )/(8M /&#960;), then for any t &gt; 0,</p><p>Here</p><p>where &#955; is the absolute spectral gap (Definition 3.1).</p><p>This is to our knowledge the first result giving a Bernstein-type inequality for sums of random matrices arising from markov chains. The first statement in the above result is made up of two terms; the (e M &#952; &#8656; M &#952; &#8656; 1)&#963; 2 /M 2 term coincides with the convex function that makes up the classic Bernstein's and Bennett's inequality. The second term in the exponent reflects the influence of the Markov chain; compared to previous scalar results <ref type="bibr">[33,</ref><ref type="bibr">54]</ref>, our techniques are able to recover a slightly improved version of this second term in terms of absolute constants. A linear algebraic perspective of the first statement above is that the first term bounds how much the product of matrix exponentials increases the magnitude of vectors in the direction of 1, which is the leading eigenvector of the operator P. The second term is a bound on how much the product of matrix exponentials increases the magnitude of orthogonal directions, hence involving &#955; .</p><p>We now give two corollaries of the above results, generalizing the assumptions above; the proofs are in the appendix and are classical. Our first corollary generalizes to complex Hermitian matrices via a complexification technique <ref type="bibr">[12]</ref>: Corollary 2.7 (Extension to Complex Matrices) Instate Assumption 2.1. Now assume that F j : X &#8594; C d&#215;d , j = 1,... ,n, is a sequence of functions each mapping to complex Hermitian d &#215; d matrices.</p><p>Under Assumption 2.3 (resp. Assumption 2.4), the conclusion of Theorem 2.5 (resp. Theorem 2.6) holds with an extra multiplicative factor of 2 on the right-hand side.</p><p>Our second corollary generalizes to the case that the Markov chain starts at a distribution &#957;: Corollary 2.8 (Extension to Nonstationary Chains) Let P be a Markov chain on general state space X with stationary distribution &#181; and absolute spectral gap &#955; . Let s 1 ,... ,s n be a sequence of states driven by P with initial distribution &#957;, where &#957; &#8658; &#181;. Instate Assumption 2.2. Let ess sup d&#957; d&#181; be the essential supremum of the Radon-Nikodym derivative d&#957; d&#181; . Under Assumption 2.3 (resp. Assumption 2.4), the conclusion of Theorem 2.5 (resp. Theorem 2.6) holds with an extra multiplicative factor of ess sup d&#957; d&#181; on the right-hand side.</p><p>Remark 2.9 In many applications these bounds are used to determine how many samples are needed (i.e., how long the chain has to run) in order for the tail probability to be less than some fixed value; in these cases the number of samples will depend only logarithmically on ess sup d&#957; d&#181; . Alternatively, the probability of an event under the Markov chain driven by P initialized at &#957; can be bounded by the probability of an event under the Markov chain driven by P initialized at &#181; up to an extra multiplicative factor of 1 + &#967; 2 (&#957; &#181;); see Lemma 29 of <ref type="bibr">[37]</ref>.</p><p>We now give some remarks on the above theorems. First, for the first statements regarding the moment generating function in both Theorems 2.5 and 2.6, the variance proxies are asymptotically optimal for a class of Markov chains. Let &#181; be a distribution on X and let P be a transition kernel such that P(x, y) = &#955; x=y + (1 &#8656; &#955; )&#181;(y), so that &#181; is the stationary distribution. Let f : X &#8594; R be any scalar-valued function, and let F : X &#8594; R d&#215;d be defined as F(x) = f (x)I, so that these are all real symmetric. It is straightforward that</p><p>since all F(x) commute. Under this setup, start with our Bernstein assumptions. Let <ref type="bibr">[17]</ref> states that for a two state Markov chain driven by a P of our above form that</p><p>which is called the asymptotic variance. Now suppose that V is such that E &#181; [ f 2 ] &#8771; V ; it is classical that any such variance proxy V that satisfies g(&#952; ) = nV &#952; 2 /2 + o(&#952; 2 ) for a function g that upper bounds the scalar moment generating function via</p><p>upper bounds Var 1 &#8730; n &#8721; n j=1 f (s j ) <ref type="bibr">[33]</ref>. This lower bound is indeed obtained by our Markov kernel P and choice of function f , and therefore the same holds for our matrix-valued function F. Now assume f are Rademacher, i.e., the probability under &#181; that f (x) = 1 is 1/2 and f (x) = &#8656;1 is 1/2. Define F the same way as above, and so we see that E &#181; [F] = 0 and &#8656;I F(x) I for all x &#8712; X . Via Equation 2.1 we again reduce to the scalar case since F commute, and through 2.2 we have that V asy = (1 + &#955; )/(1 &#8656; &#955; ), since the variance of f is 1. Now in this case we have</p><p>as can be seen from our results with d = 1. We see that n&#945;(&#955; ) is the variance proxy for a sub-Gaussian random variable and so naturally upper bounds the asymptotic variance. This lower bound is again attained for P and f , and therefore the same holds for the matrix-valued function F.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Roadmap</head><p>We begin with the quantity</p><p>for &#966; &#8712; [&#8656;&#960;/2, &#960;/2] and &#952; &gt; 0, developed first in <ref type="bibr">[16,</ref><ref type="bibr">60]</ref>. This will play the role of our moment generating function. Using properties of the Kronecker product, in Section 4 we show that this moment generating function is equal to</p><p>where P is the lifted Markov operator and E &#952; /2 j is a certain multiplication operator (Definition 3.4). This quantity is a matrix version of a classical quantity that appears in the analysis of sums of Markovdependent scalar random variables -indeed, when d = 1 these are equivalent. Thus, we can apply a spectral analysis to bound this quantity.</p><p>In Section 5 we proceed with this study; this section addresses the main challenges of extending to continuous state spaces. Using Cauchy-Schwartz we bound the above by</p><p>is a sort of limit for a corresponding moment generating function. More precisely, we show that</p><p>where s 1 ,... ,s n is driven by the Markov chain represented by the Leon-Perron operator P; this result appears in Lemma 5.4 and 5.5. The realness assumption on the F j is important for the proof of this result -we invoke an interesting generalization of the Perron-Frobenius theorem that applies to linear transformations leaving invariant a cone. This result is our main technical innovation and is crucial in extending existing results to the continuous state space setting; conveniently it also allows us to sharpen known bounds in discrete state spaces. Section 6 gives our final bounds. We first show how to transfer bounds on our moment generating function to tail bounds; the technique is straightforward and uses the multi-matrix Golden-Thompson inequality from <ref type="bibr">[16]</ref>. We then give these tight bounds on the moment generating function under both Hoeffding and Bernstein-type assumptions. For Hoeffding-type assumptions we use a coupling technique to exhibit a two-state chain that acts as the "limit" of our chain. The leading eigenvalue of the corresponding operator can be solved exactly, giving optimal bounds. For Bernstein-type assumptions, we use a robust linear algebraic approach to bound the operator norm of a related matrix directly; this approach will also give optimal results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Preliminaries</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Notation</head><p>Lower case, unbolded a denotes scalars or scalar-valued functions, bold a denotes vectors or vectorvalued functions, and upper case A denotes matrices or matrix-valued functions.</p><p>The operator &#8855; will denote the Kronecker product, where for matrices A, B of size a &#215; b, c &#215; d, respectively, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Markov chains</head><p>Let X be a general state space with &#963; -algebra B := B(X ). Let P be a Markov kernel on X defined, for a sequence of random variables X 1 ,... ,X n , in the standard way as</p><p>with stationary measure &#181; so that</p><p>for h : X &#8594; C measurable. This is a a Hilbert space equipped with the following inner product:</p><p>The norm of a linear operator T is defined in the usual way:</p><p>A transition kernel P acts as an operator on &#8467; 2 (&#181;):</p><p>The projection operator &#928; corresponding to the distribution &#181; is defined as</p><p>(1 is the function such that 1(x) = 1 for all x); this is a rank-1 operator by definition, and is indeed a projection onto</p><p>An important subset of elements of &#8467; 2 (&#181;) will be the class of "mean-zero functions," denoted</p><p>Then we have the following definition:</p><p>Definition 3.1 (Absolute spectral gap) A Markov kernel P with stationary measure &#181; admits an absolute spectral gap 1 &#8656; &#955; (P) if</p><p>Note that &#955; (P) &#8771; 1 always, as Ph &#181; &#8771; h &#181; by Jensen's inequality, with equality for h = 1. When it is clear from context we will just use &#955; = &#955; (P). The absolute spectral gap characterizes the convergence of a Markov chain to its invariant measure <ref type="bibr">[59]</ref>. For reversible finite-state chains, the value &#955; (P) corresponds to the second largest eigenvalue and the existence of the gap corresponds to ergodicity. A Markov chain driven by P is reversible if and only if P is a self-adjoint operator on &#8467; 2 (&#181;).</p><p>We define what is known as a Leon-Perron operator <ref type="bibr">[15,</ref><ref type="bibr">33,</ref><ref type="bibr">41]</ref> that will be important in the sequel: We can interpret Pc as a transition kernel such that at a state, it stays at that state with probability c or samples a new state independently from &#181; with probability 1 &#8656; c. Note that if P admits an absolute spectral gap then P does so as well with the same absolute spectral gap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Behavior in tensored space</head><p>As we work with matrices, the above will have to be lifted to a tensored space as necessary. We first consider the product space</p><p>with inner product induced from the standard Euclidean product, i.e., a &#8855; c, b &#8855; d = a, b c, d , and extend by linearity. Oftentimes we will simply identify an element as y &#8712; C d 2 ; it is straightforward that the standard Euclidean inner product on this larger space is equivalent to the tensored inner product. So it is no loss to move between one representation to the other. Now define &#8467; 2 (&#181; &#8855; 1) as the following "lift" of &#8467; 2 (&#181;): formally, define</p><p>measurable as a vector-valued function. This is equipped with the inner product</p><p>and corresponding norm f &#181; = &#63729; f, f &#181; . This gives the understanding of &#8467; 2 (&#181; &#8855; 1) as a direct integral of Hilbert spaces indexed by the points of X ; decomposable elements in this space are understood to be of the form</p><p>Norms of linear operators are thus defined with respect to this norm in the usual way.</p><p>For a Markov operator P on &#8467; 2 (&#181;) we can define the operator P := P &#8855; I d 2 as an operator on &#8467; 2 (&#181; &#8855; 1). This operator acts as</p><p>note that ( Ph)(x) is a vector. Whereas for the operator P the leading eigenfunction was 1 and spans the entire eigenspace for the leading eigenvalue of 1 (assuming nonzero spectral gap), for P the leading eigenspace is</p><p>For the projection operator &#928; we can define the lifted version as &#928; := &#928; &#8855; I d 2 , again satisfying &#928; P = P &#928; = &#928;, that performs exactly this operation, i.e., &#928;(</p><p>. This can all be extended by linearity and convergence in &#8467; 2 . Then there is a corresponding lift of "mean-zero functions" as</p><p>and can also be identified as</p><p>Similarly, define</p><p>The following lemma, Lemma 3 from <ref type="bibr">[56]</ref>, will imply that if P admits an absolute spectral gap, then so does P with the same absolute spectral gap. Thus, many of the essential spectral properties of P and P are the same. The proof is deferred to the appendix.</p><p>Lastly, we can define Leon-Perron operators of P as the corresponding lifts of Leon-Perron operators of P, which is tensorization by I d 2 . With some abuse of notation, we will often interchange the notation P to refer to both Leon-Perron operators of P and Leon-Perron operators of P, but the usage will be clear from context.</p><p>Recall for a scalar-valued function g, the multiplication operator M g is defined as (M g h)(x) = g(x)h(x). For a matrix-valued function G we have the following definition: Definition 3.4 (Multiplication operator) Let G be a matrix-valued function mapping X to C d 2 &#215;d 2 matrices. Define M G be the operator on &#8467; 2 (&#181; &#8855; 1) such that for any h &#8712; &#8467; 2 (&#181; &#8855; 1), (M G h)(x) = G(x)h(x). We also call these block diagonal operators.</p><p>In other words, M G is multiplication by a function G in the direct integral of Hilbert spaces <ref type="bibr">[50]</ref>. We will often use the multiplication operator E &#952; H defined as</p><p>, sometimes dropping the subscript H when it is clear from context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Starting point</head><p>Our starting point is the following matrix moment generating function:</p><p>where &#966; &#8712; [&#8656;&#960;/2, &#960;/2], &#952; &gt; 0, and s 1 ,... ,s n is a sequence of states driven by the Markov chain with transition matrix P.</p><p>To simplify this moment generating function, we first use the property that tr[AB ] = vec(I d ) (A &#8855; B) vec(I d ), where &#8855; is the Kronecker product and I d is the d &#215; d identity matrix to write</p><p>and then use successive applications of the property (AC)</p><p>Lastly, we use the fact that exp(A)</p><p>where H 1 ,... ,H n is a sequence of matrix-valued function on X , implicitly with respect to &#966; , defined as</p><p>for x &#8712; X . Note that unfortunately H j (x) is not in general Hermitian for any x, though its real and imaginary parts are both symmetric. We emphasize the identity</p><p>as we will switch back and forth from these two forms as needed.</p><p>The above allows us to focus on the matrices H j , which are measurable and satisfy many of the same probabilistic properties as F j -a more precise statement is given in the next section. We then write</p><p>Recall that E &#952; j the multiplication operator defined as (E &#952; j h)(x) = exp(&#952; H j (x))h(x) for any vectorvalued function h -see Definition 3.4, and recall that P = P &#8855; I d 2 . E &#952; j and P act via</p><p>and</p><p>as operators. Then simplifying Equation <ref type="formula">4</ref>.5, and using the definition of the inner product on &#8467; 2 (&#181; &#8855; 1), we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Bounding the operator norm</head><p>Having derived the above, in this section we give a series of bounds that shift our focus to bounding the norm of a perturbed operator; we address the main technical challenges of studying general continuous state spaces.</p><p>From Equation <ref type="formula">4</ref>.6, we apply Cauchy-Schwartz and submultiplicativity of the spectral norm to obtain</p><p>(5.1)</p><p>Now P possesses many of the same spectral properties as P, most notably the result from Lemma 3.3. Recall that P is the "lifted" Leon-Perron operator for P, defined as P = (&#955;</p><p>The following lemma allows us to replace P with its Leon-Perron version P.</p><p>Lemma 5.1 The operator P and its Leon-Perron version P satisfy the following:</p><p>(1) For any g, h</p><p>(3) For any multiplication operator M G with respect to a measurable matrix valued function G, and for</p><p>Proof For (1), we have</p><p>where the first inequality follows because &#928; P = P &#928; = &#928; and because &#928; is a projection and is thus self-adjoint on &#8467; 2 (&#181; &#8855; 1). Because of this we also have</p><p>For (2), we have</p><p>The result follows by noting that</p><p>where the second to last line follows because I &#8656; &#928; is a projection. We can express these terms more explicitly. First, for the denominator, we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Now for any Hermitian matrix A and any vector</head><p>Rearranging finishes the proof.</p><p>Going back to Equation <ref type="formula">5</ref>.1, applying Lemma 5.1 parts ( <ref type="formula">2</ref>) and ( <ref type="formula">3</ref>), and noticing that vec</p><p>Thus, we have transferred the study of the problem from P, which represents a general, nonreversible Markov chain, to P, which represents a reversible chain and thus is much simpler to analyze. Therefore, the operators</p><p>are self-adjoint on &#8467; 2 (&#181; &#8855; 1). Now we focus on bounding the leading eigenvalues of these operators; we make one more simplification:</p><p>. Let E &#952; T j be the operator defined as</p><p>Then the norms of E &#952; /2 j PE &#952; /2 * j and E &#952; /2 T j PE &#952; /2 T j are the same. Proof The operator E &#952; /2 j PE &#952; /2 * j is similar to the operator PE &#952; /2 * j E &#952; /2 j . The operator E &#952; /2 * j E &#952; /2 j acts via</p><p>where the first equality holds because</p><p>T j have the same spectrum; as they are both self-adjoint, the spectral radius equals the operator norm.</p><p>At this point, we have reduced our problem to a considerably simpler form. We can now show that the operators T j satisfy many of the same probabilistic properties as F j ; the proof is in the appendix. </p><p>where all expectations are taken with respect to a measure on X , and F and T are measurable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">The leading eigenvalue as the limit</head><p>In this subsection we will prove a limit lemma, justifying the use of the leading eigenvalue as a characterization of the moment generating function, mirroring the scalar case as laid out in <ref type="bibr">[10]</ref> essentially, what we will prove is that the log limit of the moment generating function is the logarithm of the leading eigenvalue. This will allow us to shift between time-dependent and time-independent functions as needed. The main difficulty in extending this theory from the scalar case is proving the limit statement in part (3) of Lemma 5.5, and in particular, a specific lower bound in the proof. This lower bound is not difficult to prove in the scalar setting; however, the matrix setting is more subtle and requires some more care.</p><p>Our main lemma is the following:</p><p>Lemma 5.4 Let F be a map from X to real symmetric d &#215; d matrices with aI F(x) bI for all x &#8712; X . Let T (x) = 1 2 (F(x) &#8855; I d + I d &#8855; F(x)), x &#8712; X , and let E &#952; /2 T be the operator defined as</p><p>Let &#181; be a distribution, let s 1 ,... ,s n be a sequence of states driven by the stationary Markov chain with transition matrix &#955; I + (1 &#8656; &#955; )1&#181; , and let P be this transition matrix tensored with I d 2 . Then</p><p>.</p><p>In other words, we take s 1 ,... ,s n to be driven by a Leon-Perron operator, where at a given state the chain stays at that state with probability &#955; and samples a state from &#181; with probability 1 &#8656; &#955; . We first address the case of a simple matrix-valued function F. </p><p>Assume that for all B j , aI B j bI, and let these be tight. Let T be the operator defined as</p><p>, with exponential multiplication operator E &#952; T . Define the matrix-valued function</p><p>Then the following hold:</p><p>(1) Let r * be such that F (r * ) has an eigenvalue of 1, and let v be an eigenvector corresponding to the eigenvalue 1 of F (r * ). Then r * is an eigenvalue of E</p><p>with corresponding eigenfunction h such that</p><p>Conversely, every eigenvalue of E</p><p>The third term is equal to</p><p>Adding the first two and subtracting the third gives zero, and shows that r * is indeed an eigenvalue with eigenfunction h.</p><p>T , then solving for h in the equation</p><p>Multiplying by exp &#952; 2 T (x) &#63725; on both sides and taking expectations, we see that</p><p>. For (2), the function F (r) is continuous in r when r is large enough (so that the inverses exist); indeed, since F is simple the expectation decomposes as a finite sum, so F is continuous if all of the summands are continuous. In particular, since the eigenvalues of all B j are bounded above by b, when r &gt; &#955; e &#952; b F (r) is continuous in r. Now when r &#8594; &#8734; the entries of F (r) approach zero, implying that all eigenvalues of F (r) also approach zero as the eigenvalues are continuous in the entries. On the other hand, F (r) decomposes as a finite sum. Consider the summand corresponding to some B j such that b is its largest eigenvalue. As r &#8594; &#955; e &#952; b from above it is clear that the largest eigenvalue of this summand goes to &#8734;; then it is clear that the largest eigenvalue of this finite sum must also go to &#8734;; implying that at least one eigenvalue of F (r) goes to &#8734;. Thus, as F (r) is entrywise continuous in r, and as the eigenvalues are continuous in the entries, for some r * &#8712; (&#955; e &#952; b , &#8734;) F (r * ) has an eigenvalue of 1. By part <ref type="bibr">(1)</ref>. this implies that r * is an eigenvalue of</p><p>T . Now let &#961; be the largest of all such eigenvalues, and so we have shown that &#961; &gt; &#955; e &#952; b .</p><p>For (3), let &#963; (E</p><p>is self-adjoint on &#8467; 2 (&#181; &#8855; 1), the spectrum is real and decomposes as the disjoint union of the essential spectrum</p><p>T ), the latter consisting of all eigenvalues with finite multiplicity. Define the essential spectral radius as the largest element of the essential spectrum and the spectral radius as the largest element of the spectrum, so the latter is always at least the former.</p><p>and thus is a compact perturbation of the operator &#955; E &#952; T , as (1 &#8656; &#955; )E &#952; /2 &#928;E &#952; /2 is finite rank (rank at most d 2 ). Weyl's perturbation theorem [67] ensures that</p><p>is a block diagonal operator (Definition 3.4), its spectrum is the union of the spectra of &#955; exp(&#952; T (x)), x &#8712; X <ref type="bibr">[50]</ref>, and so the essential spectral radius is bounded above by &#955; e &#952; b . Therefore, by part <ref type="bibr">(2)</ref>. the largest eigenvalue &#961; is equal to the spectral radius of</p><p>To prove the second statement, from Lemma 5.1 we have</p><p>This implies lim sup</p><p>Let h be a unit norm eigenfunction corresponding to &#961;, and let h be the projection of</p><p>by definition of h being an eigenfunction and it being a projection of E &#952; /2</p><p>T (1 &#8855; vec(I d )). Next, we have</p><p>is positive semidefinite on &#8467; 2 (&#181; &#8855; 1) -this follows from E &#952; /2 T being self-adjoint and P being positive semidefinite on &#8467; 2 (&#181; &#8855; 1). Then</p><p>Putting Equations 5.3 and 5.5 together proves the statement. We now show that</p><p>To do this we make a connection to dynamical systems and the theory of positive operators in Banach spaces, and which uses our realness assumption on F. For a cone K in a Banach space, an operator A is positive with respect to K if AK &#8838; K. In our case, let the Banach space under consideration be the space of d &#215; d symmetric matrices equipped with the Frobenius norm, or equivalently, the space of length d 2 vectors whose reshaping to a d &#215; d matrix is symmetric equipped with the Euclidean norm; we will show that it is no loss to restrict to this space. Let S d + be the cone of real positive semidefinite d &#215; d matrices; it is a closed, convex cone. It is also reproducing in the space of real symmetric d &#215; d matrices, in that every symmetric matrix X can be written as X = X 1 &#8656; X 2 for X 1 , X 2 positive semidefinite <ref type="bibr">[27]</ref>. We can also identify S d + as the space of length d 2 vectors whose reshaping to a d &#215; d matrix is positive semidefinite; we now make this identification. Now let K &#8834; &#8467; 2 (&#181; &#8855; 1) be the infinite direct product of S d + indexed by elements of X ; in other words, for h &#8712; &#8467; 2 (&#181; &#8855; 1), h &#8712; K if and only if h(x) &#8712; S d + for all x &#8712; X . This cone is is closed, convex, and reproducing in the space of all h &#8712; &#8467; 2 (&#181; &#8855; 1) such that h(x) can be reshaped to a symmetric d &#215; d matrix for all x &#8712; X .</p><p>Next, we show that we can indeed restrict ourselves to the Banach space of symmetric matrices equipped with the Frobenius norm, in the sense that any eigenfunction h of E is positive with respect to K, and the spectral radius &#961; is strictly larger than the essential spectral radius, which is at most &#955; e &#952; b . Then there is a nonzero eigenfunction h in K &#8838; &#8467; 2 (&#181; &#8855; 1).</p><p>This allows us to assert that in the derivation of Equation <ref type="formula">5</ref>.5 we could have chosen h &#8712; K. With this choice of h it is straightforward in showing</p><p>&#63732;&#63732; .</p><p>Each of these matrices exp &#952; 4 F(x) &#63725; &#293;(x) exp &#952; 4 F(x) &#63725; are positive semidefinite; moreover, as h is nonzero in the &#8467; 2 space &#293;(x) is nonzero on a set of positive measure (i.e., &#293;(x) cannot be nonzero only on a set of measure zero). Therefore, since exp &#952; 4 F(x)</p><p>&#63725; is full rank, by Sylvester's inertia theorem <ref type="bibr">[61]</ref> we have that exp &#952; 4 F(x) &#63725; &#293;(x) exp &#952; 4 F(x)</p><p>&#63725; has at least one strictly positive eigenvalue for all x in a set of positive measure, meaning that the trace is strictly positive. Then</p><p>Using this lemma, we are now able to give the proof of Lemma 5.4.</p><p>Proof of Lemma 5. <ref type="bibr">4</ref> We first reduce to the simple case. The range of F is compact, as F can be seen as a continuous map of the compact set O(d) &#215; [a, b]. Then for &#949; &gt; 0 let N &#949; be an &#949;-net of the range of F with respect to the metric induced by the operator norm. Define F &#949; as</p><p>.</p><p>In fact, we can give a more refined proof of part (3). of Lemma 5.5 to show that lim n&#8594;&#8734; a &#949;,n uniformly converges in &#949;. Let &#961; &#949; be the leading eigenvalue of E &#952; /2</p><p>T &#949; . We first have by Lemma 5.1 that</p><p>Next, we showed for any unit norm eigenfunction h &#949; corresponding to the leading eigenvalue &#961; &#949; that</p><p>In the proof of part (3). of Lemma 5.5 we showed that h &#949; can in fact be chosen such that for all x &#8712; X , h &#949; (x) can be reshaped to a positive semidefinite d &#215; d matrix. With such structure we lower bounded</p><p>We can actually give a more quantitative lower bound of this quantity that is independent of &#949;; in addition, we can give an upper bound &#961; &#949; also independent of &#949;.</p><p>First note that under an &#949;-net of the range of F, since aI F(x) bI, that aI F &#949; (x) bI also. Then the upper bound is simple; we have</p><p>For the lower bound we first write</p><p>&#63732; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Since we have established exp</head><p>&#63725; is positive semidefinite for all x &#8712; X , the trace is equal to the nuclear norm, which upper bounds the Frobenius norm. Therefore,</p><p>Now since F &#949; takes only finite number of values, we have that h &#949; can also only take a finite number of values (see part (1) of Lemma 5.5). Letting A j = F &#8656;1 (B j ) and denoting h j = h &#949; (x) for x &#8712; A j , we have</p><p>&#181;(A j ) h j .</p><p>Now as h &#949; &#181; = 1 we must have h j &#8771; 1, and so then</p><p>&#8594; e &#952; a , and we thus have a &#949;,n &#8656; log &#961; &#949; &#8594; (&#952; (a &#8656; b))/n. Putting this all together, we have shown</p><p>which implies uniform convergence, since d, &#952; , a, and b are all constants independent of &#949;. We can then swap limits: lim</p><p>&#949;&#8594;0 lim n&#8594;&#8734; a &#949;,n = lim n&#8594;&#8734; lim &#949;&#8594;0 a &#949;,n .</p><p>But as E &#952; T &#949; converges to E Then we have the chain of equalities</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Final bounds</head><p>In this section we prove our main theorems regarding the moment generating function and give concentration inequalities. We first show how to go from bounds on the moment generating function of Equation <ref type="formula">4</ref>.1 to tail bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">From the MGF to tail bounds</head><p>The reason we have studied the moment generating function of Equation <ref type="formula">4</ref>.1 is due to the following result:</p><p>Theorem 6.1 (Multi-matrix Golden-Thompson inequality, <ref type="bibr">[16]</ref>) Let H 1 ,... ,H k &#8712; C d&#215;d be Hermitian matrices. Then</p><p>where &#958; is some probability distribution on &#8656; &#960; 2 , &#960; 2 .</p><p>From this, we can obtain tail bounds. Recall that F j : X &#8594; R d&#215;d are a sequence of measurable functions from X to d &#215; d real symmetric matrices, and let E[F j ] = 0. First,</p><p>We would like to apply Theorem 6.1; we follow the standard outline given in <ref type="bibr">[16]</ref>. An immediate application of Jensen's inequality on the right-hand side of Theorem 6.1 furnishes an upper bound, and then taking exponents, we have Combining the above two lines, adding in &#952; , and taking expectation, we have</p><p>Now let M F (&#952; ) be such that</p><p>that is independent of &#966; , i.e., an upper bound on the moment generating function in Equation <ref type="formula">4</ref>.1 valid for any &#966; &#8712; [&#8656;&#960;/2, &#960;/2]. Then</p><p>where the last line holds because &#958; is a probability distribution and M F does not depend on &#966; . We now give our final bounds on the moment generating function</p><p>under both Hoeffding and Bernstein-type assumptions. In Sections 4 and 5 we have reduced the problem to estimating the leading eigenvalues of the operators</p><p>for j = 1,... ,n and P the lifted Leon-Perron version of P. Lemma 5.4 has shown that each eigenvalue is a limit a corresponding moment generating function. Thus, the general idea will be to bound</p><p>for each j, and where s 1 ,... ,s n is a sequence of states driven by the Markov chain corresponding to P. In these bounds we will proceed by way of defining a function M F to bound these operators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Hoeffding</head><p>In this section, fix a matrix-valued function F, and let E &#181; [F(x)] = 0 and aI F(x) bI for all x &#8712; X . For a scalar convex functions &#936;, we have the following proposition: Proposition 6.2 Let H be a Hermitian d &#215; d matrix such that aI H bI. Then for any scalar convex function &#936;,</p><p>Here &#936; acts spectrally; if H = U&#923;U * , then &#936;(H) = U&#936;(&#923;)U * , where &#936; acts on the diagonal matrix &#923; diagonally, i.e., &#936;(&#923;) is the diagonal matrix such that &#936;(&#923;) ii = &#936;(&#923; ii ).</p><p>Proof We simply need to show that for any unit vector u that</p><p>It suffices to show this for eigenvectors v of H. If &#955; is the corresponding eigenvalue, then</p><p>Since &#936; is a scalar convex function and a &#8771; &#955; &#8771; b,</p><p>For general u, let u = &#8721; d k=1 c k v k be the representation of u in the basis of eigenvectors of H. Then</p><p>If in the above H is a random matrix such that E[H] = 0, then taking expectations we have This allows us to make the following argument about a convex ordering of distributions, with proof in the appendix: Lemma 6.3 Let F be a function from X to Hermitian d &#215; d matrices, and assume E &#181; [F] = 0 and aI F(x) bI for all x &#8712; X . Let s 1 ,... ,s n be driven by the Leon-Perron operator P. Let Q, &#181;, G be as above. Let y 1 ,... ,y n be driven by Q. Then for any scalar nonnegative convex function &#936;,</p><p>Remark 6.4 The condition on &#936; above does not seem to be very strict. Indeed, it applies to many natural matrix-valued functions, such as even matrix powers and the matrix exponential.</p><p>Applying this to the scalar convex function x &#8704; &#8594; exp &#952; cos(&#966; )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2</head><p>x yields</p><p>Now let M &#952; &#8712; R 2d 2 &#215;2d 2 be the operator on the two-state state space {0, 1} such that</p><p>Then Lemma 5.4 indicates that</p><p>where &#951; is the largest eigenvalue of</p><p>Then by properties of the Kronecker product, the eigenvalues of the matrix on the left-hand side are equal to the eigenvalues of K &#952; , a 2 &#215; 2 matrix, the largest of which can be solved for explicitly. Putting this all together,</p><p>With the above we can establish the following result, with proof in the appendix.</p><p>Lemma 6.5 Let K &#952; be the 2 &#215; 2 matrix defined in Equation <ref type="formula">6</ref>.8, and let &#951; be its largest eigenvalue. Then</p><p>where &#945; :</p><p>Putting our work above together, we can now prove the two statements of Theorem 2.5.</p><p>Proof of Theorem 2.5 The discussion in Lemma 5.1 (in particular Equation <ref type="formula">5</ref>.2) and Proposition 5.2 reveal</p><p>where recall</p><p>T j is the multiplication operator with respect to the matrix-valued function T j and P = (&#955; I +(1 &#8656; &#955; )&#928;) &#8855; I d 2 . Then Lemma 5.4 and Lemma 6.3 allow us to bound each of these norms with the norms of corresponding matrices arising from a two state chain, each of which can be bounded using Lemma 6.5. In other words, for each j we have, verifying that the operators T j satisfy the assumptions of the lemmas using Proposition 5.3,</p><p>Then</p><p>and setting &#966; = 0 we obtain the first statement of the theorem.</p><p>To turn this into a tail bound, since cos(&#966; ) 2 &#8771; 1, we can let M F (&#952; ), as described in Equation 6.3, be</p><p>Using Equation <ref type="formula">6</ref>.4, we then have</p><p>Optimizing this quadratic over &#952; we have</p><p>as desired.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Bernstein</head><p>In this section, fix a matrix-valued function F, and let E &#181; [F(x)] = 0, E &#181; [F(x) 2 ] &#8771; V , and F(x) &#8771; M for all x &#8712; X . We take a slightly different approach to bounding the norms of the operators E &#952; /2</p><p>T j . In particular, consider again Equation <ref type="formula">4</ref>.1 for a time-independent function T and the Leon-Perron operator P. Then</p><p>with the last line following from &#928; P = &#928; so we can introduce another P. Now from Lemma 5.4 and the above we have</p><p>Although the above may seem a bit contrived, it allows for a general recursive method used in <ref type="bibr">[16,</ref><ref type="bibr">26,</ref><ref type="bibr">56,</ref><ref type="bibr">68]</ref>; more importantly, we can extend the final result to time-dependent functions via our limit lemma. In particular, for an initial function z 0 , we can track the updates z k := ( PE &#952; T ) k z 0 recursively, and then provide a bound. This method is strong enough to give the sorts of inequalities we are after. Now we introduce some notation. For an element z &#8712; &#8467; 2 (&#181; &#8855; 1) let z = &#928;z and z &#8869; = (I &#8656; &#928;)z. In other words, z decomposes as z = z &#181; (1 &#8855; v) + z &#8869; &#181; &#961; for some v and some &#961; orthogonal to</p><p>As per our discussion above, we will bound the evolution of PE &#952; T z for any vector z. We first have the following proposition: Proposition 6.6 For any z &#8712; &#8467; 2 (&#181; &#8855; 1),</p><p>Proof This follows immediately from the definitions.</p><p>We next have a crucial lemma, the proof deferred to the appendix. For ease of exposition, we will assume that &#966; = 0 throughout, since the factor of cos(&#966; ) is just a scalar and can be pulled through.</p><p>where</p><p>With this lemma we have the following two claims, with proofs in the appendix. These are recursive bounds.</p><p>Claim 6.9 For any k, if &#952; &lt; log(1/&#955; )/M , then</p><p>We are now able to prove Theorem 2.6.</p><p>Proof of Theorem 2.6 As in the proof of Theorem 2.5, Lemma 5.1 and Proposition 5.2 give</p><p>Assume throughout that &#966; = 0 for simplicity, so we can directly use Lemma 6.7 without modification. Equation <ref type="formula">6</ref>.18 indicates that log E &#952; /2</p><p>Claim 6.9 exactly gives a bound on this term; letting z 0 &#8712; &#8467; 2 (&#181; &#8855; 1), the claim indicates that</p><p>with &#945; j,1 , &#945; j,2 , &#945; j,3 as in Lemma 6.7 and using V j in the definitions of the &#945; j as we are dealing with time-dependent functions. Then &#928;( PE &#952;</p><p>implying</p><p>and thus</p><p>To turn this into a tail bound, we will first assume M = 1, i.e., F j &#8771; 1 for all j, and then drop this assumption later. Using Equation <ref type="formula">6</ref>.4 we have</p><p>Then</p><p>&#955; (e 4&#952; /&#960; &#8656; 1) 2  1 &#8656; &#955; e 4&#952; /&#960; .</p><p>(6.26)</p><p>Write L = 4/&#960;. Using the inequality e x &#8656; 1 &#8771; xe x , the above becomes</p><p>(6.27)</p><p>Now we argue that &#963; 2 &#955; &#952; 2 e 2L&#952; 1 &#8656; &#955; e L&#952; &#8771; &#963; 2 &#955; &#952; 2 1 &#8656; &#955; &#8656; 2L&#952; when 0 &#8771; &#952; &lt; (1 &#8656; &#955; )/2L, which is less than log(1/&#955; ) when 0 &lt; &#955; &lt; 1. Indeed, comparing both sides, it suffices to show e 2L&#952; 1 &#8656; &#955; e L&#952; &#8771; 1 1 &#8656; &#955; &#8656; 2L&#952; for this range of &#952; , which is equivalent to showing e &#8656;2L&#952; &#8656; &#955; e &#8656;L&#952; &#8594; 1 &#8656; &#955; &#8656; 2L&#952; . At &#952; = 0 they are equivalent, so we compare derivatives. The derivative of the left-hand side is &#955; Le &#8656;L&#952; &#8656; 2Le &#8656;2L&#952; = Le &#8656;L&#952; (&#955; &#8656; 2e &#8656;L&#952; ), while the derivative of the right-hand side is just &#8656;2L. Then</p><p>which is true by seeing that the left-hand side is a decreasing function in &#952; , so for &#952; &#8594; 0 attains its maximum at &#952; = 0. At this value of &#952; , the left-hand side is 1 &#8656; &#955; /2 &lt; 1.</p><p>Then</p><p>To recap, at this point we have bounded</p><p>where</p><p>It is standard that the conjuguate g * (t) = sup &#952; {&#952;t &#8656; g(&#952; )} bounds the tail probability via</p><p>for a convex function g <ref type="bibr">[22]</ref>. Using Lemma D.2 with L = 4/&#960;, the second statement in Theorem 2.6 is immediate in the case M = 1.</p><p>More generally, we have</p><p>Plugging this in and rearranging delivers the second statement of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">An Application to Covariance Estimation and Markov PCA</head><p>Principal Component Analysis (PCA) is one of the most fundamental problems in data science, used to reduce the dimensionality of multidimensional datasets while retaining as much of the variance as possible and implemented in a wide variety of downstream applications <ref type="bibr">[35]</ref>. The task is to estimate the largest eigenvector of a population covariance matrix from samples. Classically, vector-valued samples x 1 ,... ,x n are received i.i.d., and the leading eigenvector of the sample covariance matrix is used to estimate the leading eigenvector of the population covariance matrix. More precisely, we fix a distribution &#181; and let &#931; = E &#181; xx &#8712; R d&#215;d . Let &#967; 1 &#8594; &#8226;&#8226;&#8226; &#8594; &#967; d be the eigenvalues of &#931; in descending order, and let v 1 be the leading eigenvector of &#931;. The following is a standard consequence of the classical matrix Bernstein inequality <ref type="bibr">[64]</ref> and Wedin's theorem <ref type="bibr">[69]</ref>: Theorem 7.1 (Offline PCA, <ref type="bibr">[32]</ref>) Fix &#948; &#8712; (0, 1). Let x 1 ,... ,x n be drawn i.i.d. from &#181;. Assume that E &#181; x j x j &#8656; &#931; &#63725; 2 &#8771; V j and x j x j &#8656; &#931; &#8771; M almost surely; let &#963; 2 = &#8721; n j=1 V j . Let v be the leading eigenvector of 1 n &#8721; n j=1 x j x j . Then with probability 1 &#8656; &#948;</p><p>where C 1 ,C 2 are absolute constants.</p><p>In many data tasks the samples x 1 ,... ,x n are not drawn i.i.d. but have instead some dependent structure; for example, this can occur with time series data <ref type="bibr">[29,</ref><ref type="bibr">34]</ref>. Specifically, the dependent structure can be Markovian, where the vectors x j := x j (s j ) are vector-valued functions of an underlying Markov chain. An example of this is in the context of token algorithms for Federated PCA, where multiple machines are connected via a graph <ref type="bibr">[14,</ref><ref type="bibr">23,</ref><ref type="bibr">25]</ref>: each machine contains some fraction of the total dataset and the goal is to obtain the principal component of the entire dataset with respect to some target stationary distribution; work has also been done to adapt this to the streaming setting <ref type="bibr">[38]</ref> (whereas our following bound is applicable to the offline setting). The following bound is the adaptation of Theorem 7.1 to the Markov setting, which is a consequence of Theorem 2.6 and Wedin's theorem; it was first found in <ref type="bibr">[38]</ref> and is derived as an immediate application of our results. Thus, our theorems are able to directly give the first known bounds for offline Markov PCA in the literature. Theorem 7.2 (Offline Markov PCA, <ref type="bibr">[38]</ref>) Fix &#948; &#8712; (0, 1). Let P be a discrete-time Markov chain on general state space with stationary distribution &#181; and absolute spectral gap &#955; . Let s 1 ,... ,s n be a sequence of states driven by P with initial distribution &#181;. Consider a sequence of vector valued functions x j := x j (s j ), j = 1,... ,n. Assume that E &#181; x j x j &#8656; &#931; &#63725; 2 &#8771; V j and x j x j &#8656; &#931; &#8771; M almost surely; let &#963; 2 = &#8721; n j=1 V j . Let v be the leading eigenvector of 1 n &#8721; n j=1 x j x j . Then with probability 1 &#8656; &#948;</p><p>where C 1 ,C 2 are absolute constants.</p><p>Observe that the bound from Theorem 7.2 is a natural generalization of the bound from Theorem 7.1 up to constants, the dependence on the absolute spectral gap &#955; , and the extra 2 &#8656; &#960;/4 factor in the dimension (this can be removed with a slightly worse probability of success); when &#955; = 0, which recovers the independent setting, the bound essentially matches that of Theorem 7.1. Note that there are now two spectral gaps: the absolute spectral gap &#955; corresponding to the underlying Markov chain, which governs mixing and thus the spectral norm bound between the sample and population covariances, and the difference &#967; 1 &#8656; &#967; 2 corresponding to the population covariance matrix &#931; that governs the quality of the estimated eigenvector from Wedin's theorem.</p><p>More generally, our bounds can be useful for analyzing algorithms for covariance matrix estimation for a stationary distribution based on dependent samples from a Markov chain <ref type="bibr">[37]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Discussion</head><p>In this work we gave a systematic study of concentration inequalities for sums of Markov-dependent random matrices, namely, sums of time-dependent matrix-valued functions of a nonreversible Markov chain on continuousgi state spaces. Our techniques broadly follow the classic literature on spectral methods for sums of Markov-dependent random variables, as we bound the largest eigenvalue of certain perturbed operators. To address time-dependent Markov chains on continuous state spaces we first gave a crucial limit lemma justifying the analysis of the largest eigenvalue of a certain operator, analogous to the scalar case; in the process, we make an interesting connection to the Krein-Rutman theorem, the theory of operators leaving invariant a cone, to tackle the spectral analysis.</p><p>In addition to this, we gave a tighter bound on the moment generating function in the Hoeffding setting, exposing the sub-Gaussian nature of the sum, and used this to give improved constants in the final tail bound. Our technique here was to construct a natural coupling with a two-state chain via a convex majorization argument, similar to the proof of the classical Hoeffding's lemma. We also gave the first Bernstein-type inequality for sums of Markov-dependent random matrices via a recursive, linear algebraic, argument -our bound on the moment generating function nicely reveals the Markov dependence, and provides corresponding tail bounds. Both the Hoeffding and Bernstein inequalities thus generalize exactly, and in the Bernstein case even improves, the scalar setting. In addition, we expect our general spectral framework to readily extend to other concentration inequalities, such as Bennett's inequality and Bernstein's inequality with unbounded summands (when there is control over the moments). We hope that our work opens up avenues for application and contributes to a growing body of work on concentration inequalities of sums of random matrices beyond independence.</p><p>One interesting direction is a possible improvement of Corollary 2.8. In general, finite moments of the Radon-Nikodym derivative d&#957; d&#181; can be much smaller than the essential supremum; specifically, it is possible in the scalar setting to tradeoff an improved multiplicative factor involving the p th moment of the Radon-Nikodym derivative with worse constants in the exponent -the proof is just H&#246;lder's inequality. However, for matrices this argument would force us to consider not the Frobenius norm squared in the moment generating function but the Frobenius norm raised to some power depending on p. It would be interesting if such a tradeoff could be made in our setting as well.</p><p>Another interesting direction is with regards to the spectral gap of the chain. Our "absolute spectral gap" is another name for multiplicative reversibilization, commonly used to quantify the mixing of nonreversible chains and is used to reduce analysis to a reversible chain. Though widespread in the literature, it can be a pessimistic estimate of mixing <ref type="bibr">[7]</ref>; it would be nice to have a bound that depends on a more "nonreversible" quantity. In addition, compared to the independent setting, our variance proxy is different and potentially slightly suboptimal; it is in term of a "sum of norms" rather than a "norm of a sum." However, we doubt this can be recovered using spectral methods; we believe an improved variance proxy would require new techniques -such as an application of functional inequalities -that are not spectral in nature or at least do not rely on the Golden-Thompson inequality utilized here, and thus imply an entirely new approach for ergodic sums of scalar random variables as well.</p><p>NSF IFML grant 2019844. The authors thank Sanjay Shakkottai for useful references and Purnamrita Sarkar for helpful discussions. Now we prove (2). For some x &#8712; X , let &#961; 1 (x) &#8594; &#8226;&#8226;&#8226; &#8594; &#961; d (x) be the eigenvalues of F(x) (the &#961; are implicitly scalar-valued functions). We have that the eigenvalues of F(x) &#8855; I d are exactly these eigenvalues but each with multiplicity d -this also holds for I d &#8855; F(x). Now as F(x) &#8855; I d and I d &#8855; F(x) are Hermitian, a crude estimate is that the eigenvalue of the sum is bounded above by 2&#961; 1 (x) &#8771; 2b and the minimum eigenvalue of the sum is bounded below by 2&#961; d (x) &#8594; 2a. The statement follows.</p><p>For (3), we have</p><p>and the same holds for</p><p>for some unit vector u. This is equal to (u * F(x)u) 2 . Now by Cauchy-Schwartz, Proof of Lemma 6.3 We first note that for any nonnegative convex function &#936;, the function &#936; p is convex for any p &#8594; 1. We first describe a generating process for s j . Draw random variables I j with I 1 = 1 and I j &#8764; Ber(1 &#8656; &#955; ) for j &gt; 1 and variables Z j &#8764; &#181; uniformly and independent from each other and let</p><p>We can verify that this is a valid construction for a Markov chain driven by P. We now couple (y j , s j ) by drawing random variables B j &#8764; &#181; independent from each other and let</p><p>We can again verify that this is a valid construction for a Markov chain driven by Q and is also a valid coupling. Then</p><p>We can expand this out as a matrix polynomial. Each monomial is of degree n and can be determined by a tuple &#945; &#8712; {0, 1} n with &#945; 1 = 1 always, each entry corresponding to I j and &#936;(F(Z j )). For each &#945; construct a coefficient c &#945; and a partition a of n as follows: if &#945; j = 0 then (1 &#8656; I j ) appears in c &#945; and a j = 0. If &#945; j = 1 then I j appears in c &#945; and a j is equal to the one plus the number of zeros appearing sequentially after &#945; j until the next one. Then we have c &#945; &#8719; n j=1 &#936;(F(Z j )) a j as our matrix monomial corresponding to &#945;. There are clearly 2 n&#8656;1 monomials in this matrix polynomial.</p><p>For example, let n = 6. The monomial corresponding to &#945; = (1, 0, 1, 0, 1, 0) is</p><p>the monomial corresponding to &#945; = (1, 0, 1, 0, 0, 1) is and the monomial corresponding to &#945; = (1, 0, 0, 0, 0, 0) is</p><p>Moreover, it is not hard to see that the monomials in this matrix polynomial are pairwise orthogonal (with respect to the trace inner product) using the property that (1 &#8656; I j )I j = 0. Therefore where we use the coupling between y j and s j . using the independence of the Z j . Now as &#936; is nonnegative and convex, &#936; a j is convex. Therefore, by Proposition 6.2 and the fact that F is mean zero, we have</p><p>so that Now since G maps to matrices that are a constant times the identity, the matrices E &#181; &#936;(G(B j )) 2a j commute with all other matrices. This allows us to write We can iterate this process, and using independence of the B j and rearranging (recall all these resulting matrices commute) we obtain the statement of the lemma.</p><p>Proof of Lemma 6. Then &#951; is simply the larger of these two. Instead of solving directly for &#951;, we can choose any value that upper bounds &#951; that has a considerably simpler form. In particular, if a quadratic is of the form x 2 &#8656; bx + c, we can choose any x such that x 2 &#8656; bx + c &#8594; 0 and x 2 &#8594; c; it can be verified that such an x upper bounds both roots of the quadratic. Thus, with our choice of &#951;(&#952; We show that this is at least &#8656;1/&#945;(&#955; ). A sufficient condition for this is  </p><p>We see this by definition of the multiplication operator E &#952; T and the definition of the inner product of &#8467; 2 (&#181; &#8855; 1). Then</p><p>&#63726;&#63723; .</p><p>(D.17)</p><p>We used the fact that E &#181; [T (x)] = 0 to remove the k = 1 term. Next,</p><p>But the inner products in the sum are bounded by the spectral norm of E &#181; [T (x) k ], so we apply Lemma 6.6.2 from <ref type="bibr">[64]</ref> and Proposition 5.3. Thus</p><p>(D. </p><note type="other">19</note></div></body>
		</text>
</TEI>
