<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Scalable Symmetric Tucker Tensor Decomposition</title></titleStmt>
			<publicationStmt>
				<publisher>SIAM Journal on Matrix Analysis and Applications, 2024</publisher>
				<date>12/31/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10663164</idno>
					<idno type="doi">10.1137/23M1582928</idno>
					<title level='j'>SIAM Journal on Matrix Analysis and Applications</title>
<idno>0895-4798</idno>
<biblScope unit="volume">45</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Ruhui Jin</author><author>Joe Kileel</author><author>Tamara G Kolda</author><author>Rachel Ward</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Tensor decompositions <ref type="bibr">[36]</ref> provide powerful tools to represent and compress higher-order arrays. In this work, we focus on the symmetric rank-r Tucker decomposition. Applied to a d-th order symmetric tensor of size n &#8594; . . . &#8594; n, it expresses the input using a symmetric core tensor C C C &#8712; R r&#8594;&#8226;&#8226;&#8226;&#8594;r (r n) and a matrix Q &#8712; R n&#8594;r with orthonormal columns. See Figure <ref type="figure">1</ref> for an illustration. Symmetric tensors abound in statistics as a means of representing higher-order moments and cumulants of a random vector. They are commonly used to analyze dependences in multivariate distributions <ref type="bibr">[12,</ref><ref type="bibr">30,</ref><ref type="bibr">14]</ref>, learn latent variable models via the method-of-moments <ref type="bibr">[4,</ref><ref type="bibr">24]</ref>, solve important signal processing problems <ref type="bibr">[9,</ref><ref type="bibr">10]</ref>, etc.</p><p>In most cases, the exact moment of the ground-truth distribution is not available. The idea of moment estimation is to use the d-th order sample moment tensor, denoted M M M d , instead. This is formed from observations {x 1 , . . . , x p } &#8838; R n of a random variable x as</p><p>There is a need for economical representations of M M M d , which would otherwise take O(n d ) space to store. Using the symmetric Tucker decomposition, one approximates M M M d solely by determining the rank-r factor matrix Q, which takes just rn memory space. (The core C C C is a byproduct once Q is identified; see Remark 1.) However, decomposing the sample moment still has upfront costs of O(pn d ) and O(n d ) to initially construct and store M M M d . For large data volumes p and high dimensionalities n, we turn to implicit (or tensor-free) and streaming strategies to escape such prohibitive overhead. The symmetric Tucker decomposition of moment tensors has other important motivations, besides compression. One major application is to extract the key features Q &#8712; R n&#8594;r from moment tensors. Without the need for the core tensor C C C &#8712; S d (R r ), this usage further reduces the output's storage. In some scientific domains, data has high volatility, e.g., in combustion simulations <ref type="bibr">[1]</ref>, outlier detection <ref type="bibr">[25,</ref><ref type="bibr">55]</ref> and asset allocation <ref type="bibr">[49,</ref><ref type="bibr">8,</ref><ref type="bibr">6]</ref>. Higher-order moments and cumulants (d &#8805; 3) are sensitive to heavy tails and anomalies. Thus, finding the key subspace Q of variability in higher-order statistics is important for non-Gaussian (also known as non-normal) data analysis. This is akin to the role of principal component analysis (PCA) and the covariance matrix for Gaussian data.</p><p>In other settings, data observations are assumed to lie in a low-dimensional subspace and follow the low-rank factor model <ref type="bibr">[61]</ref>. For example, this holds in the Fama-French model <ref type="bibr">[22]</ref> of finance and also in psychometrics <ref type="bibr">[48]</ref>. The resulting moment tensors naturally admit a Tucker format. Thus the decomposition enforces the low Tucker rank structure and improves moment estimation accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Related works</head><p>Due to the NP-hard nature of tensor decompositions <ref type="bibr">[29]</ref>, an approximate symmetric Tucker decomposition is computed by higher-order eigenvector decomposition (HOEVD) <ref type="bibr">[15]</ref>. It is the eigendecomposition of the flattened tensor. It can be viewed as the symmetric version of higher-order singular value decomposition (HOSVD) <ref type="bibr">[14]</ref> and efficient variants <ref type="bibr">[64,</ref><ref type="bibr">43,</ref><ref type="bibr">62,</ref><ref type="bibr">45,</ref><ref type="bibr">11,</ref><ref type="bibr">2]</ref>. Despite their popularity, HOEVD o"ers no guarantee of returning the exact critical points for the problem.</p><p>We instead turn to iterative search methods to achieve the best possible solution, using HOEVD as an initialization. Newton <ref type="bibr">[21,</ref><ref type="bibr">33]</ref>, quasi-Newton <ref type="bibr">[57]</ref> and trust-region <ref type="bibr">[32]</ref> methods have been considered for general Tucker decomposition. These second-order optimizations rely on Riemannian Hessian updates on a the manifold of orthonormal matrices. For symmetric tensors, higher-order orthogonal iterations (HOOI) <ref type="bibr">[16]</ref>, Jacobi rotations <ref type="bibr">[31,</ref><ref type="bibr">41]</ref> and power method <ref type="bibr">[54]</ref> were proposed. Regalia briefly mentions a shifted variant of the power method in <ref type="bibr">[54]</ref>, which is akin to a gradient descent method, although strong convergence analysis is omitted in the work.</p><p>Our work is primarily motivated by moment tensor decompositions, which are fundamental objects in statistics and other disciplines. Neither the existing HOEVD nor iterative algorithms for Symmetric Tucker take advantage of the sample moment tensor structure <ref type="bibr">(1)</ref>. We show that streaming and implicit techniques can be applied to address the computational burden in the data-rich and high-dimensional regime. For covariance matrices which are centered second-order moments, this is relevant to the streaming PCA scheme, also known as Oja's method <ref type="bibr">[51]</ref>, and its adaptations <ref type="bibr">[46,</ref><ref type="bibr">67,</ref><ref type="bibr">28]</ref>.</p><p>Recently, a similar strategy was applied to optimization with moment tensor CP decomposition for learning Gaussian mixture models, see <ref type="bibr">[59,</ref><ref type="bibr">52]</ref>. Building on techniques from <ref type="bibr">[59]</ref>, the present work also passes over only a subset of samples at a time and drives the optimization without forming gigantic moment tensors. Despite using similar approaches for increasing efficiency, we emphasize that CP and Tucker decompositions are useful for their own independent purposes. The motivations we list for feature extraction and data factor analysis specifically apply to Tucker decompositions. We refer readers to Section 4.5 for a concrete comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Our contributions</head><p>In this paper, our major contribution is developing scalable algorithms for symmetric Tucker decomposition on moment tensors.</p><p>To start with, we propose the projected gradient descent (PGD) framework to the general symmetric Tucker tensor decomposition problem, see Algorithm 1. The approach is led by a first-order gradient update governed by the Tucker approximation cost. It is free of non-trivial second-order updates on manifolds.</p><p>Driven by their importance in data science, our attention then focuses on decomposing sample moment tensors <ref type="bibr">(1)</ref>. By exploiting the special input structure, we develop scalable PGD (SPGD) in Algorithm 2.</p><p>There are two main highlights. First, we adopt the implicit technique and operate on the input sample vectors directly. This much cheaper in memory and computation than constructing the sample moment tensor. Proposed in <ref type="bibr">[59,</ref><ref type="bibr">52]</ref>, the implicit strategy greatly reduces the e"ort for gradient computation. Second, SPGD is built on a streaming model. By passing over small batches of data observations, both storage and computational complexity are further reduced. Similar ideas have been used in CP moment tensor decomposition <ref type="bibr">[59]</ref> for separate purposes.</p><p>We also design a scalable implementation of HOEVD (SHOEVD) for sample moment tensors <ref type="bibr">(1)</ref>, see Algorithm 3. This method is akin to the streaming PCA idea. We use the SHOEVD approximation (Algorithm 3) as the initialization to SPGD (Algorithm 2). Moreover, SHOEVD may be of independent interest due to its broad usages, e.g. <ref type="bibr">[13,</ref><ref type="bibr">3]</ref>. The savings from the streaming and implicit strategies allow SHOEVD and SPGD to handle large-scale computations. The gains are summarized in Table <ref type="table">1</ref>. Here p is the total sample number and b is the batch size at each iteration. Usually, the user sets b p. These efficiency advantages are corroborated by extensive numerical experiments.</p><p>We demonstrate the key motivations of sample moment decompositions, including feature extraction and moment tensor low-rank estimation. Our proposed SPGD is respectively applied to real datasets in anomaly detection and portfolio allocation.</p><p>In terms of theoretical guarantee, the basic PGD (Algorithm 1) convergence result is provided. Employing the theories from manifold optimization and dynamical systems, we prove that for any symmetric tensor, PGD converges monotonically to a first-order critical point given arbitrary initialization. Furthermore, it converges to a second-order critical point almost surely. Scalars, vectors, matrices and tensors are represented as lowercase letters x, lowercase bold letters x, uppercase bold letters X, and calligraphic bold letters X X X, respectively. The (i 1 , . . . , i d )-th element of a d-way tensor is denoted by x i1,...,i d . The j-th column of a matrix X is x j . The spectral norm of a matrix is &#8226; 2 . The norm of a tensor is &#8226; , see Definition 5. The identity matrix is I n &#8712; R n&#8594;n . The Moore-Penrose pseudoinverse of a matrix is denoted using &#8224;. The matrix symmetrization operator is written sym : R n&#8594;n &#8656; S 2 (R n ) and given by sym(X) =<ref type="foot">foot_0</ref> 2 (X + X ) for X &#8712; R n&#8594;n . We write orth for the function that selects the orthonormal factor Q &#8712; R n&#8594;r from the QR decomposition of a full-rank n &#8594; r matrix. 1 The notation X [d] is the elementwise exponentiation, where each coordinate of X is raised to the d-th power. The set {1, . . . , n} is denoted by <ref type="bibr">[n]</ref> when n is a positive integer.</p><p>Definition 1 (symmetric tensors). A tensor X X X &#8712; R n&#8594;&#8226;&#8226;&#8226;&#8594;n is symmetric if its entries are invariant under any permutation of indices, i.e.</p><p>where &#928;(d) is the permutation group on <ref type="bibr">[d]</ref>. Here, d and n are called the order and dimension of X X X. The space of all real-valued symmetric tensors of order d and dimension n is denoted by S d (R n ).</p><p>Definition 2 (symmetric outer product). The outer product of a vector x &#8712; R n with itself d times is denoted as x &#8855;d &#8712; S d (R n ). The (i 1 , . . . , i d )-th entry of this product is</p><p>Definition 3 (symmetric Tucker product). For a symmetric tensor X X X &#8712; S d (R n ) and a matrix Y &#8712; R n&#8594;m , their symmetric Tucker product is defined as X X X with Y multiplied along each mode, written X X X &#8226; (Y, . . . , Y) &#8712; S d (R m ). The (j 1 , . . . , j d )-th entry of the product is</p><p>, their inner product in all modes except the first one <ref type="bibr">[21,</ref><ref type="bibr">17]</ref> is defined as:</p><p>where z j1,j 1 =</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Problem statement</head><p>Let X X X &#8712; S d (R n ) be a symmetric tensor. Fix a user-specified rank r (typically r n). Our goal is to find the best approximation of X X X represented by the symmetric Tucker product of a core tensor C C C &#8712; S d (R r ) and a rank-r orthonormal matrix Q &#8712; R n&#8594;r . The problem is formulated as follows:</p><p>In fact, the minimization (2) is equivalent to solving for a rank-r basis Q that maximizes the following cost function: max</p><p>Remark 1. The problem formulation transforms from (2) to ( * ) by solving the unconstrained linear least squares in (2) for the core, i.e.,</p><p>As X X X 2 is a constant in Q, minimizing the above quantity is inded the same as maximizing X X X &#8226; (Q, . . . , Q) 2 .</p><p>Remark 2. The full Tucker decomposition (2) takes O(nr + r d ) in memory to store the basis Q &#8712; R n&#8594;r and the core C C C &#8712; S d (R r ). Though still exponential in d, this cost savings can be substantial compared to the original X X X &#8712; S d (R n ) requiring O(n d ) space if the rank r is much less than the ambient dimension n. Further, in applications such as tensor PCA <ref type="bibr">[47]</ref> and anomaly detection Section 5.1, one only desires the feature subspace Q, and the core C C C need note be computed or stored. This reduces the cost to O(nr).</p><p>The feasible domain in ( * ) for Q is the Stiefel manifold:</p><p>which is a compact Riemannian submanifold of R n&#8594;r with dimension r(2n-r-1)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2</head><p>. Further, the cost function F is rotationally invariant. In fact, it only depends on the orthogonal projector P = QQ &#8712; S 2 (R n ), since X X X &#8226; (Q, . . . , Q) = X X X &#8226; (P, . . . , P) . So, the optimization ( * ) naturally runs on the Grassmannian manifold, which consists of all orthogonal projectors P:</p><p>Here, Gr(n, r) is a compact Riemannian manifold of S 2 (R n ) with dimension r(nr).</p><p>Remark 3. No simple characterization of the critical points of ( * ) is known when d &#8805; 3. This is in contrast to case of d = 2, where the classical Eckart-Young theorem <ref type="bibr">[20]</ref> holds. Various possible extensions of Eckart-Young to tensors have been shown to fail, e.g., see <ref type="bibr">[35]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">PGD algorithm</head><p>This paper develops a projected gradient descent (PGD) method to solve the constrained optimization problem ( * ). The PGD framework works as follows: given an initial point, we move the current iterate Q t-1 along the negative gradient direction by a step size of &#949; t &gt; 0 where the gradient is:</p><p>Then we project the update back to the Stiefel manifold (4) by selecting the Q factor from the QR decomposition. The procedure is described in Algorithm 1. See Section 6 for a convergence analysis of PGD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 PGD for symmetric Tucker decomposition</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">HOEVD initialization</head><p>An approximate solution to the best symmetric Tucker decomposition ( * ) is the HOEVD <ref type="bibr">[15]</ref> result:</p><p>given by the leading r eigenvectors of mat(X X X) mat(X X X) .</p><p>HOEVD is widely used for Tucker approximation, beause it can be computed via a direct matrix eigendecomposition.</p><p>Generally speaking, the HOEVD solution is not a critical point of ( * ) <ref type="bibr">[15]</ref>. Its approximation error to a d-th order tensor X X X is bounded [26, Theorems 6.9-10] as</p><p>Here, X X X hoevd = X X X &#8226; (Q hoevd Q hoevd , . . . , Q hoevd Q hoevd ) as in Remark 1 and X X X best is the optimum of (2). We observe the suboptimality of the HOEVD solution in Section 4.4. However according to [14, section 3.4], the HOEVD solution usually belongs to an "attractive" region around the desired local optima in numerical experiments. So it is a good candidate for initialization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Scalable sample moment tensor decomposition</head><p>The main interest in this work is the decomposition of the sample moment tensor M M M d &#8712; S d (R n ) (1). We design scalable algorithms for the symmetric Tucker decomposition ( * ), and the HOEVD approximation <ref type="bibr">(7)</ref>, of M M M d . An average of symmetric outer products, the sample moment structure <ref type="bibr">(1)</ref> is the key to the scalability of the decomposition computation. Tensor operations with outer products can be reduced to much cheaper computations involving only their factor vectors. Specifically we use the following identities: for vectors x, y &#8712; R n , z &#8712; R m and a matrix Y &#8712; R n&#8594;m , it holds <ref type="bibr">[26]</ref> x</p><p>x &#8855;d , y &#8855;d &#8660; = x, y&#8660; d &#8712; R, and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Scalable PGD algorithm</head><p>In Algorithm 2, we introduce the scalable PGD (SPGD) when the input tensor is a sample moment <ref type="bibr">(1)</ref>. The symmetric Tucker cost ( * ) now depends solely on the data observations X = [x 1 , . . . , x p ] &#8712; R n&#8594;p . Indeed due to <ref type="bibr">(8)</ref>,</p><p>We actually do not need to create or store the sample moment M M M d . Such an idea was called the implicit technique in <ref type="bibr">[59,</ref><ref type="bibr">52]</ref>.</p><p>Commonly, there is a huge amount of data in exceedingly high dimensions. We hence further turn to the streaming strategy to speed up the computations. Here data is revealed as a sequence of sample batches {X 1 , . . . , X t , . . . } &#8838; R n&#8594;b with batch size b (b p). We read each batch as it arrives but discard it afterwards. This streaming model is called the turnstile model <ref type="bibr">[50,</ref><ref type="bibr">42,</ref><ref type="bibr">63]</ref>.</p><p>Therefore, as the t-th (t &#8805; 1) batch X t = [x t,1 , . . . , x t,b ] &#8712; R n&#8594;b arrives, the cost function <ref type="bibr">(11)</ref> is recast stochastically:</p><p>In light of <ref type="bibr">(10)</ref>, we can compute the stochastic gradient in a tensor-free manner:</p><p>Here, ( &#8226; ) [d-1] means raising each element of the input matrix to the (d -1)-th power. Computing the gradient implicitly via online samples reduces the cost to O(rnb + rb 2 ) flops per iteration. This improves upon O(rnp + rp 2 ) using the full amount of samples, and a cost that is exponential in d if we operate on the explicit tensor in S d (R n ). Moreover, a scalable HOEVD solver, shoevd in line 1 of Algorithm 2, is utilized for good initialization; see Section 3.2 for details. We also use AdaGrad <ref type="bibr">[19]</ref> to automatically tune step sizes. We implement the column-wise extension of AdaGrad proposed in <ref type="bibr">[28]</ref>, since it is natural to independently update the orthonormal basis vectors in Q. The step size (&#949; &#949; &#949;) t becomes a row vector in R 1&#8594;r , rather than a scalar. See lines 4 and 5, where ./ denotes columnwise division.</p><p>SPGD is a two-phased iterative algorithm. It consists of</p><p>&#8226; Phase I -SHOEVD approximation (line 1),</p><p>&#8226; Phase II -symmetric Tucker decomposition (lines 2-8).</p><p>Algorithm 2 SPGD for sample moment Tucker decomposition Input:</p><p>step size 5:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Scalable HOEVD</head><p>We present the scalable HOEVD (SHOEVD) method in Algorithm 3. This serves as the Phase I initialization in SPGD (Algorithm 2), but it may be of independent interest. As HOEVD is essentially a matrix eigendecomposition, we follow the streaming PCA idea. The solver uses the PGD framework, but is applied to a di"erent cost function than <ref type="bibr">(11)</ref>. Specifically, we solve HOEVD iteratively finding Q &#8712; St(n, r) that maximizes the following stochastic cost formed by a data batch</p><p>Using <ref type="bibr">(10)</ref>, the gradient can be calculated implicitly:</p><p>It costs O(rnb 1 + nb 2 1 + rb 2 1 ) flops to compute the gradient. Remark 4. SPGD and SHOEVD (Algorithms 2 and 3) obviate the upfront cost to form the tensor, which would require O(pn d ) flops. Each iteration temporarily stores an updated matrix Q t &#8712; R n&#8594;r and an online sample batch X t &#8712; R n&#8594;b , taking only nr + nb memory. This explains the complexity counts shown in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 3 SHOEVD for sample moment decomposition</head><p>for t = 1, . . . , T 1 do 5:</p><p>:</p><p>t + colnorm(G t ) [2]   [1/2] 7:</p><p>end for <ref type="formula">3</ref>) can be evaluated as follows. After obtaining Q &#8712; St(n, r) by SPGD or SHOEVD (Algorithms 2 and 3), we draw fresh data X c = [(x c ) 1 , . . . , (x c ) pc ] &#8712; R n&#8594;pc and then set:</p><p>The core computation costs O(r d ) in storage and O(p c rn + p c r d ) in flops.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Synthetic numerical tests</head><p>In this section, we utilize synthetic tests to study the proposed algorithms for sample moment tensor decomposition. And real-world datasets are considered in Section 5. In this section, we show:</p><p>&#8226; the scalability of implicit compared to explicit methods in Figure <ref type="figure">2</ref>,</p><p>&#8226; the speedup of streaming compared to full data usage in Figure <ref type="figure">3</ref>,</p><p>&#8226; the robustness of the HOEVD approximation and the symmetric Tucker decomposition to noise and target rank choice in Figures <ref type="figure">4</ref> and <ref type="figure">5</ref>, and</p><p>&#8226; a comparison between symmetric Tucker and CP in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Set-up</head><p>In these tests, we generate moment tensors with approximate low-rank symmetric Tucker structure (3). One way is to assume data instances x &#8712; R n follow a linear factor model:</p><p>Here we fix the factor loading matrix B &#8712; R n&#8594;rtrue . The latent factor f &#8712; R rtrue is drawn from a coordinateindependent, zero-mean and unit variance normal-inverse Gaussian (NIG) distribution <ref type="bibr">[5,</ref><ref type="bibr">6]</ref>. This distribution is skewed and heavy-tailed, aligning with some motivations discussed in Section 1. The noise</p><p>Note that in <ref type="bibr">(15)</ref>, the d-th moment of x roughly admits a low-rank Tucker format:</p><p>We define the inverse signal-to-noise ratio</p><p>The above quantity (rather than the deviation &#963;) is to measure the model's noise level.</p><p>Remark 6. Our goal is to simply perform the symmetric Tucker decomposition of the sample moment tensor M M M d &#8712; S d (R n ). This is different from inference procedures on the linear factor model, e.g., independent component analysis (ICA) <ref type="bibr">[14]</ref>.</p><p>Remark 7. Besides the model <ref type="bibr">(15)</ref> or other heavy-tailed datasets, we emphasize that our proposed algorithms are able to efficiently decompose M M M d when the sample vectors come from any probabilistic model.</p><p>All numerical tests are implemented using MATLAB R2022a on a MacBook Pro with 16GB RAM memory. Each reported plot is averaged over 5 independent simulations with independent, random initializations. The matrix B in ( <ref type="formula">15</ref>) is randomly generated with i.i.d. standard normal entries and then fixed for all runs. We generate new f and &#949; &#949; &#949; &#949; &#949; &#949; &#949; &#949; &#949; to form the data samples in each stochastic simulation. The parameters of each experiment, namely n, r true , r, p, d, SNR -1 , are specified below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Implicit versus explicit algorithms</head><p>We illustrate the scalability of implicit and explicit implementations in Figure <ref type="figure">2</ref>. The experiments are nonstreaming. The implicit methods run on a fixed set of observations X &#8712; R n&#8594;p . The explicit approaches require constructing the sample moment tensor M M M d &#8712; S d (R n ) as overhead, and then operating on it. Due to resource limitations, the maximal tensor dimension n for explicit methods is smaller than maximal dimension for the implicit methods. We set the sample size to be p = 5n and choose the rank as r = r true = 5. We consider non-streaming methods, so b 1 = b 2 = p. The sample moment formation uses the symktensor package in the Tensor Toolbox for MATLAB <ref type="bibr">[37]</ref>. The noise level is SNR -1 = 0.5, see <ref type="bibr">(16)</ref>. We conduct the experiments in two phases for order-3 and 4 sample moments M M M d . For Phase I (initialization), we compare the efficiency of the implicit HOEVD result,</p><p>and its explicit counterpart <ref type="bibr">(7)</ref>. The equivalence of the two methods is by <ref type="bibr">(10)</ref>. Then in Phase II (iterations), we run the SPGD Algorithm 2 and various explicit algorithms, namely the explicit PGD Algorithm 1, HOOI <ref type="bibr">[15]</ref>, quasi-Newton <ref type="bibr">[57]</ref> and trust-region <ref type="bibr">[32]</ref> methods. <ref type="foot">2</ref>For the last three algorithms, we use the code tucker als.m in Tensor Toolbox <ref type="bibr">[37]</ref>, symalgQNlc.m in package <ref type="bibr">[56]</ref> and lmlra3 rtr.m in TensorLab <ref type="bibr">[65]</ref>, respectively. HOOI and trust-region apply to general asymmetric tensors. We omit the trust-region method in d = 4 case as its code does not apply to the fourth order. The step size for Algorithm 2 is c 2 = 1000. Note that the implicit Algorithm 2 matches the explicit Algorithm 1 iteration-by-iteration, because there is no streaming. However, they calculate the updates di"erently because Algorithm 2 is implicit.</p><p>Figure <ref type="figure">2</ref> shows the running time of the two phases as the dimension n increases. The time on the vertical axes shows when the algorithm reaches a relative gradient to be less than 10 -12 . This measure of convergence precision is used in <ref type="bibr">[21,</ref><ref type="bibr">57,</ref><ref type="bibr">32]</ref> and defined as:</p><p>Here grad F (Q) = (I n -QQ )&#8711;F (Q) &#8712; R n&#8594;r denotes the Riemannian gradient on the Stiefel manifold (4), where &#8711;F (Q) &#8712; R n&#8594;r is the Euclidean gradient <ref type="bibr">(6)</ref>. We use the Riemannian gradient grad F (Q) because  the possible update direction on the Stiefel manifold is constrained to the tangent space at Q. The time for checking the relative gradient is included in the runtime. It is evident from Figure <ref type="figure">2</ref> that the implicit HOEVD <ref type="bibr">(17)</ref> and SPGD (Algorithm 2) are much faster than the explicit algorithms. The advantage of being tensor-free allows the implicit implementation to handle computation in higher dimensions. Meanwhile, the prohibitive upfront and storage costs in the explicit methods restrict their scalability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Streaming versus full data usage</head><p>We stress the gains provided by the streaming technique. This is demonstrated in stochastic experiments for (a) SHOEVD (Algorithm 3) and (b) SPGD (Phase II of Algorithm 2) in Figure <ref type="figure">3</ref>. The vertical axes represent the approximation accuracy evaluated by the metric:</p><p>The measure is rotationally invariant. We plot the subspace error against runtime.</p><p>To compare streaming and full data methods, we first generate a pool of observations X &#8712; R n&#8594;p . Each streaming method selects a subset of samples X(:, b(t-1)+1 : bt) from the pool at step t. We ensure the total sample number p is just enough for the streaming case with the largest batch size b max . The non-streaming counterparts use all the samples X at once. We set the number of iterations T = 250, and the total sample size p = T b max = 25, 000.</p><p>In Phase I (a), the full HOEVD refers to the implicit way <ref type="bibr">(17)</ref>, which is already more efficient than explicit HOEVD <ref type="bibr">(7)</ref>. The step size for all streaming cases of Algorithm 3 is c 1 = 0.05. In Phase II (b), all  plots start at the same initialization by Algorithm 3 with T 1 = 20, b 1 = 50 and c 1 = 1. These parameters for Phase I are di"erent from the (a) setting. They do not a"ect the efficiency of (b). The streaming and full cases for the actual Phase II use step size c 2 = 1 and T 2 = 10 respectively.</p><p>It is seen in Figure <ref type="figure">3</ref> that the streaming Algorithms 2 and 3 have limited loss of accuracy compared to the methods that use the full dataset. On the other hand, the runtimes for all streaming cases are nearly negligible compared to timings for the full methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Robustness</head><p>We test how the noise level in the observations and the choice of target rank a"ect HOEVD and symmetric Tucker decomposition in Figures <ref type="figure">4</ref> and <ref type="figure">5</ref>. In the two experiments below, we plot the evolution of both phases of Algorithm 2, see Figures <ref type="figure">4</ref> and <ref type="figure">5</ref>. As the HOEVD phase (in dashed lines) plateaus, its last iterate gives the initialization for Phase II (in solid lines). The number of iterations are shown on the horizontal axes. The parameters in Algorithm 2 are set to be</p><p>We first consider the algorithms' sensitivity to noise in the data samples. To test this, we generate a sequence of clean samples and then plant Gaussian noise with di"erent levels of SNR -1 . The algorithms' results are shown in Figure <ref type="figure">4</ref>. In particular, note the improvement in Phase II over the HOEVD phase. The discrepancy is enlarged in the high noise regime (SNR -1 &#8805; 1), when the HOEVD solver provides a poor solution. In contrast, relatively good approximation is obtained using the symmetric Tucker decomposition (Phase II) when SNR -1 is no bigger than 1.</p><p>Next, we test various rank choices in Figure <ref type="figure">5</ref>. All cases pass over the same set of data observations. The algorithms' recovery qualities are reported in terms of the objective value F ( * ), and compared to the "ground-truth" value F (Q * ). As there is no deterministic moment M M M d in the streaming case, we estimate F using an additional set of testing samples X test &#8712; R n&#8594;ptest drawn from <ref type="bibr">(15)</ref>. We calculate F implicitly using <ref type="bibr">(9)</ref>:</p><p>When r &#8805; r true , the solutions obtained by Algorithm 2 fully approximate the moment tensor in the sense that their objective values match the ground truth. When r &lt; r true , we can only partially recover the tensor.  The less the rank is, the lower the objective value. Also note that as r decreases, HOEVD's initialization moves further away from the ground truth and the gap between the phases is more pronounced.</p><p>Figures <ref type="figure">4</ref> and <ref type="figure">5</ref> show that the robustness of the symmetric Tucker decomposition is stronger than that of the HOEVD approximation. For "hard" problem instances where noise is high or the rank is under-specified, the HOEVD phase (in dashed lines) is clearly improved upon during Phase II (in solid lines).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Tucker versus CP decomposition</head><p>This subsection shows the situations where symmetric Tucker decomposition is more appropriate to use than the symmetric CP counterpart. We provide numerical evidence in the moment tensor approximation quality for low-rank factor model, while maintaining a low storage budget.</p><p>We employ the relative error as the accuracy metric:</p><p>The testing sample moment M M M d test &#8712; S d (R n ) serves as the reference solution and is built from the dataset X test &#8712; R n&#8594;ptest via (1). The approximation tensor M M M d &#8712; S d (R n ) is obtained via either the symmetric Tucker or the symmetric CP <ref type="bibr">[59]</ref> decompositions. These solutions admit the formats:</p><p>In <ref type="bibr">(18)</ref>, we denote the Tucker and CP rank by r and r respectively. The Tucker approximation is determined by the basis Q &#8712; R n&#8594;r and its core C C C &#8712; S d (R r ) is via (3). The CP format consists of the coefficients {&#955; i } r i=1 &#8834; R and the vector components {v i } r i=1 &#8834; R n . For the Tucker and CP solutions, we respectively apply implicit stochastic methods: our Algorithm 2 and Algorithm cp isym.m <ref type="bibr">[59]</ref> with Adam optimization from Tensor Toolbox <ref type="bibr">[37]</ref>.</p><p>The data observations are drawn from the linear factor model <ref type="bibr">(15)</ref>. Here, the latent factor f &#8712; R rtrue follows the standard Gaussian distribution with zero mean and identity covariance. The noise level is SNR -1 = 0.05. In one simulation, we first draw the full data set X test &#8712; R n&#8594;ptest with dimension n = 500 and sample size p test = 10000. We form order d = 4 moment tensors. At each iteration of both algorithms, we compute the gradient based on a sample batch X t &#8712; R n&#8594;b with batch size b = 50. Their objectives are evaluated based on the full testing tensor M M M d test . The algorithms take a complete pass over X test in one simulation (for Tucker) and one epoch (for CP). All epochs of one CP simulation operate on the same data X test .</p><p>In Table <ref type="table">2</ref>, we compare the accuracies of the symmetric Tucker versus CP decompositions, in reference to M M M d test . Regarding the target rank, the Tucker method sets it equal to the true rank: r true . There are two choices for the CP method: r true and r d-1 true . For each table entry, we report the lowest error from 5 independent simulations. We see that the Tucker estimation achieves the greatest accuracy on a low rank choice. On the other hand, CP decomposition su"ers substantially larger error (10000 times worse than Tucker) when its target rank is small. CP manages to fit moment tensors with high precision similar to Tucker's level for r = r d-1 true . However, comparing the storage costs, Tucker is much more economical in representing the moment tensor. From <ref type="bibr">(18)</ref>, Tucker costs rn + r+d-1 d &#8776; r true n to store Q and C C C, whereas the CP decomposition costs r (n + 1) &#8776; r d-1</p><p>true n in storage to achieve a similar approximation. We conclude that there are important situations where the Tucker structure provides a more natural and more accurate format for moment tensors than CP does. This holds especially when the data samples lie close to a low-rank subspace.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Applications</head><p>This section illustrates the applicability of moment Tucker decompositions to real datasets from anomaly detection and portfolio allocation.</p><p>The computations are not wholly streaming, as we need to store and preprocess the data first. However, only data batches are used while running Algorithms 2 and 3. The task results are based on one simulation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Anomaly detection in hyperspectral imaging</head><p>Outliers can be discovered using higher-order statistics, when the mean and covariance information are not sufficient <ref type="bibr">[18,</ref><ref type="bibr">1]</ref>. In hyperspectral imaging (HSI) <ref type="bibr">[53]</ref>, the detection goal is to find a low-rank feature subspace, denoted by Q &#8712; St(n, r). After transforming HSI pixels by Q, unusual targets in the image are revealed <ref type="bibr">[25]</ref>. Intuitively if the anomalies are captured by colspan(Q), they should stand out as bright spots in the transformed image.</p><p>This section tests three methods to extract the feature subspace from HSI data: Tucker factorization of the skewness (order-3 standardized moment) <ref type="bibr">[44]</ref>, HOEVD applied to the skewness, and eigendecompostion of the covariance. Here, skewness and covariance are formed from HSI pixels. The results are qualitatively and quantitatively evaluated according to their performance in capturing anomalies, as in <ref type="bibr">[25]</ref>.</p><p>Figure <ref type="figure">6</ref> shows the data structure of a typical HSI image. It is a 3-way hypercube X X X &#8712; R n&#8594;p1&#8594;p2 . The data consists of n band images in R p1&#8594;p2 and p = p 1 p 2 HSI pixels in R n . We use a real hyperspectral image from the Airport-Beach-Urban dataset. <ref type="foot">3</ref> The scene was taken at Gulfport, Florida by the Airborne band</p><p>Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. The top two images of Figure <ref type="figure">7</ref> are the false-color composite image and the detection reference image (with white spots depicting the anomalous aircraft). The number of bands is n = 191 and the size of each band is p 1 &#8594; p 2 = 100 &#8594; 100. We use scalable algorithms for the moment decomposition task.</p><p>First, the HSI cube X X X &#8712; R n&#8594;p1&#8594;p2 is reshaped into an n &#8594; p matrix. This is whitened <ref type="bibr">[23]</ref> to give X &#8712; R n&#8594;p . We choose rank r = 4, and seek the leading rank-r subspaces Q &#8712; R n&#8594;r from the skewness tensor M M M 3 &#8712; S 3 (R n ) and the covariance matrix M 2 &#8712; S 2 (R n ), both formed using the columns in X. The subspaces are found using SPGD (Algorithm 2), SHOEVD (Algorithm 3), and PCA. The parameters for Algorithms 2 and 3 are b 1 = b 2 = 100, c 1 = 0.35, c 2 = 0.5 and T 1 = 500, T 2 = 1500. For each subspace Q obtained, X is transformed into Q X &#8712; R r&#8594;p . Each row of Q X is reshaped to a p 1 &#8594; p 2 matrix. The resulting bands 1 -4 are shown in Figure <ref type="figure">7</ref>.</p><p>Since the reference map in Figure <ref type="figure">7</ref> is sparse, we further pursue a sparsest detection image by optimizing over weights on the bands. This is achieved by a constrained &#8467; 1 -minimization:</p><p>This is solved by the MATLAB Optimization Toolbox function fmincon. Then we reshape X Qw * into a band image. The resulting sparse images are in the first column of Figure <ref type="figure">7</ref>. Inspecting Figure <ref type="figure">7</ref>, we see that SPGD and SHOEVD based on skewness distinguish between abnormal and normal pixels well. The subspace provided by SPGD enables especially clear detection, since the targets are the most pronounced. It outlines both aircraft individually in the sparse image. The covariance information, however, fails to identify anomalies.</p><p>false-color image reference map sparsest SPGD (Alg. 2) (d=3) band 1 band 2 band 3 band 4 SHOEVD (Alg. 3) (d=3) PCA (d=2) In Figure <ref type="figure">8</ref>, we show a quantitative evaluation in anomaly detection, via the receiver operating characteristic-area under curve (ROC-AUC). This quantifies how accurate a method is in classifying normal and abnormal objects, where a higher AUC value indicates better detection. To draw the ROC curve, we whiten the matrix X Q and assign the anomaly score for each pixel as its &#8467; 2 -norm. The dashed curve indicates a random classifier. Note that the ROC of skewness-based detection schemes lie strictly above that of PCA. The AUC values corresponding to SPGD, SHOEVD and PCA are 0.9947, 0.9900 and 0.8893.</p><p>10 -4 10 -3 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 1 false alarm rate detection probability SPGD (Alg. 2) SHOEVD (Alg. 3) PCA random </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Moment estimation for portfolio allocation</head><p>Moment estimation is sometimes improved by enforcing a low-rank structure <ref type="bibr">[6]</ref>. In portfolio allocation, we seek a distribution over investments to maximize future returns. Judicious predictions can be obtained by including higher-order moment estimations in allocation optimization, as financial data is heavy-tailed <ref type="bibr">[34]</ref>.</p><p>Higher moments of financial data should roughly admit a low-rank symmetric Tucker format, because of the Fama-French model <ref type="bibr">[22]</ref>. In this section, we test two ways to calculate moments: the sample moment M M M d (1) and the low-rank symmetric Tucker approximation to M M M d . We evaluate the moments by deriving portfolios based on them and comparing the portfolio performances, as in <ref type="bibr">[6]</ref>.</p><p>We use the real HFR hedge fund dataset. <ref type="foot">4</ref> It consists of daily return rates (in percentage) of n = 30 HFRX indices from 01/02/2009 to 12/31/2019. The time series is split into in-sample data (2009 -2016) X in &#8712; R n&#8594;pin (p in = 2016) and out-of-sample data (2017 -2019) X out &#8712; R n&#8594;pout (p out = 754). The insample data is centered. We calculate the averaged sample moments (1):</p><p>Then their low-rank Tucker approximations are computed by SPGD (Algorithm 2). The rank is set to r = 15, and (3) is used to solve for the cores once the subspaces Q are determined. We run Phase II of Algorithm 2 twice. Starting from a random initialization, we obtain Q for M 2 . This is used an initialization for decomposing M M M 3 . Finally we tackle M M M 4 , starting from the Q for M M M 3 . The parameters are set to c 2 = 1, 10 -2 , 10 -5 ; T 2 = 1000, 1000, 200; and b 2 = 500, 500, 500 for the three runs respectively. This produces</p><p>Given moment estimates, we select the allocation weights that maximize the expected utility (EU) [6, Supplementary Material, Eqn. ( <ref type="formula">17</ref>)]:</p><p>where the risk aversion parameter &#181; is set to be 1. We estimate E (xx) &#8855;d &#8776; M M M d in <ref type="bibr">(19)</ref>, and then apply the MATLAB Optimization Toolbox function fmincon to output the optimal weights. Meanwhile, estimating E (xx) &#8855;d &#8776; M M M d Tucker results in another weight vector.</p><p>The two portfolios are evaluated on the out-sample data. The acquired daily returns {y t } &#8838; R are collected in a vector y = X out w * &#8712; R pout . In Figure <ref type="figure">9</ref>, we plot the cumulative daily return rates ( &#63737; t t =1 (1 + y t 100 ) -1) &#8594; 100 at days t = 1, . . . , p out . We observe that the low-rank moment estimates produce a better return rate than the plain sample moments do. Despite mild fluctuations, the two strategies always produce positive return rates. A naive strategy, the equally-weighted portfolio, is the most unstable. It sometimes yields negative returns.</p><p>01/03/17 01/02/18 01/02/19 01/02/20 -4 -2 0 2 4 6 8 time return rate(%) cumulative daily return (2017-2019) low-rank moment sample moment equally weighted Other statistical measures are shown in Table <ref type="table">3</ref>. The annual geometric mean of returns over the years 2017 -2019 is</p><p>(yt-y) d pout . Large geometric mean and low absolute values of moments are desirable allocation outcomes. In each row, we highlight the preferred score in bold. They are all achieved by the portfolio based on low-rank Tucker approximations to the moments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Convergence analysis</head><p>This section presents a convergence theory for the basic PGD method (Algorithm 1) applied to symmetric Tucker decomposition ( * ). We note several points in the analysis: 1) The tensor X X X needs not arise as a moment tensor; even when it does, we do not analyze the e"ects of streaming. 2) Proving convergence of the non-streaming algorithm to critical points is needed first, because ( * ) is non-convex, with a complex landscape when d &#8805; 3 (recall Remark 3). 3) Our results do not immediately follow from known general properties of plain GD or projected GD. We use the specific properties of the Tucker objective function and QR retraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Main statements</head><p>Recall the set-up in Section 2.2. Because the Tucker factorization cost ( * ) is non-unique as</p><p>the convergence of {Q t } is not relevant. Instead we focus on the convergence of the projectors, P t := Q t Q t &#8712; Gr(n, r). Thus, while Algorithm 1 runs on the Stiefel manifold and generates the sequence of bases {Q t } &#8838; St(n, r), we analyze the sequence of subspaces {P t } &#8838; Gr(n, r) on the Grassmannian manifold. Given scalar step sizes {&#949; t } &#8838; R &#8805;0 , the update scheme starts from the PGD iteration on Q:</p><p>The corresponding update on P is defined as</p><p>St(n, r)</p><p>Gr(n, r)</p><p>The sequence {P t } is assessed in terms of the cost function</p><p>As noted in Section 2.2, it is equivalent to the cost in ( * ), as f (P) = F (Q) when P = QQ . We prove that the iterates {P t } converge to a critical point of f on the Grassmannian. This is stronger than just showing convergence of the objective values {f (P t )} in R.</p><p>The first and second-order criticality conditions are stated as follows in the manifold setting.</p><p>Here, grad f (P) &#8712; T P Gr(n, r) &#8838; S 2 (R n ) denotes the Riemannian gradient, which is defined as the Euclidean gradient &#8711;f (P) &#8712; S 2 (R n ) projected to T P Gr(n, r). The tangent space at P is given by</p><p>Here, Hess f (P) :</p><p>The first theorem is a uniform guarantee: Algorithm 1 always converges to a first-order critical point regardless where the iteration begins. Theorem 6.1. Let X X X &#8712; S d (R n ) be any symmetric tensor, and consider the objective function f defined by <ref type="bibr">(22)</ref>. There exists a constant &#915; * (X X X, r) &gt; 0 such that if the step sizes {&#949; t } &#8838; R &#8805;0 satisfy sup t &#949; t &#8734; &#915; * and inf t &#949; t &gt; 0, the following holds. For all initializations Q 0 &#8712; St(n, r), the sequence {P t } &#8838; Gr(n, r) generated via (20)-( <ref type="formula">21</ref>) by Algorithm 1 converges monotonically to a first-order critical point (23) of f . The convergence is at no less than an algebraic rate.</p><p>Next, in fact, Algorithm 1 to a second-order critical point, for almost all initializations. Theorem 6.2. Let X X X &#8712; S d (R n ) be any symmetric tensor, and consider the cost f defined by <ref type="bibr">(22)</ref>. Use a constant step size &#949; for all iterates. There exist a constant &#915; * * (X X X, r) &gt; 0 and a measure-zero subset I &#8838; St(n, r) such that if 0 &lt; &#949; &#8734; &#915; * * , then for all initializations Q 0 &#8712; St(n, r) \ I, the sequence {P t } &#8838; Gr(n, r) generated via (20)-( <ref type="formula">21</ref>) converges to a second-order critical point (25) of f . We build the proofs of Theorems 6.1 and 6.2 respectively in Sections 6.2 and 6.3. The main technical tools are a convergence theorem for real-analytic cost functions (based on the ' Lojasiewicz inequality) and the center-stable manifold theorem. These are stated in Appendix A. Detailed proofs for supporting propositions and lemmas are in the Supplementary Materials. They involve non-trivial calculations for the QR retraction and the symmetric Tucker cost function <ref type="bibr">(22)</ref>.</p><p>Remark 8. The convergence of PGD in Gr(n, r) is not for free. Previously, Regalia found example tensors X X X where our Algorithm 2 with a step size approaching infinity shows an oscillatory cycle, rather than convergence. See [54, Figures <ref type="figure">1</ref> and <ref type="figure">4</ref>].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Monotonic convergence to first-order critical points</head><p>To establish Theorem 6.1, we apply the guarantee stated in Theorem A.1, originally from <ref type="bibr">[58,</ref><ref type="bibr">Theorem 2.3]</ref>. It requires us to prove {P t } meets C1 -C4 conditions in Theorem A.1.</p><p>We first compute the Riemannian gradient of f (22) on Gr(n, r). Although Theorem 6.1 is in terms of P &#8712; Gr(n, r), our proof uses Q &#8712; St(n, r) due to the explicit update rule in <ref type="bibr">(20)</ref>. Thus, we state the relation between grad f (P) and Q, &#8711;F (Q). As short-hand notation, define w : Gr(n, r) &#8656; S 2 (R n ) by</p><p>Note that w(P) is a positive semidefinite (PSD) matrix.</p><p>Proposition 6.3. For Q &#8712; St(n, r) and P = QQ &#8712; Gr(n, r), the following holds grad f (P) = sym 2 (I n -P) w(P)P = sym</p><p>Proposition 6.3 is proven in Appendix B. Next, we analyze the update rule on P &#8712; Gr(n, r). For t &#8805; 0 and step size &#949; t &gt; 0, define &#945; t : Gr(n, r) &#8656; S 2 (R n ) by &#945; t (P) = P + 2&#949; t Pw(P)P &#8712; S 2 (R n ).</p><p>The expression &#945; t (P) and the gradient grad f (P) help to represent the update on P. Lemma 6.4. Let a step t &#8805; 0. The difference between the current iterate P t and the next P t+1 (20)-( <ref type="formula">21</ref>) is approximated as follows:</p><p>The proof of Lemma 6.4 is in Appendix C. It relies on the Taylor expansion of <ref type="bibr">(20)</ref>.</p><p>The following lemma gives properties of &#945; t (P) and shows that when the step size is sufficiently small, grad f (P t )&#945; t (P) &#8224; and grad f (P t ) are on the same order. Lemma 6.5. Let t &#8805; 0 and P &#8712; Gr(n, r). The pseudoinverse &#945; t (P) &#8224; &#8712; S 2 (R n ) satisfies &#945; t (P) &#8224; &#945; t (P) = P. Moreover, there exists a constant &#915; 1 (X X X, r) &gt; 0 such that if</p><p>Please refer to Appendix D for the proof of Lemma 6.5. It utilizes the fact that &#945; t is a bounded continuous function of P.</p><p>Proof Theorem 6.1. With the above useful lemmas and proposition in place, in this proof, we show that our update scheme for P &#8712; Gr(n, r) (20)-( <ref type="formula">21</ref>) satisfies the conditions C1 -C4 in Theorem A.1. Fix a step t &#8805; 0. Assume 0 &lt; &#949; t &#8734; &#915; 1 where &#915; 1 is as in Lemma 6.5.</p><p>(C1) We apply <ref type="bibr">(30)</ref> that O grad f (P t )&#945; t (P t ) &#8224; has the same scale as O ( grad f (P t ) ). By Lemma 6.4, it follows</p><p>It implies that there exists a constant C 1 (X X X, r) &gt; 0 such that</p><p>By <ref type="bibr">(59)</ref> in Appendix C, grad f (P t )&#945; t (P t ) &#8224; = (I n -P t )w(P t )P t &#945; t (P t ) &#8224; = (I n -P t )w(P t )&#945; t (P t ) &#8224; , and so &#63734; grad f (P t )&#945; t (P t ) &#8224; , &#945; t (P t ) &#8224; grad f (P t ) &#63726; = trace (I n -P t )w(P t )&#945; t (P t ) &#8224; (I n -P t )w(P t )&#945; t (P t ) &#8224; = 0. It implies sym grad f (P t )&#945; t (P t ) &#8224; = &#8712; 2 2 grad f (P t )&#945; t (P t ) &#8224; .</p><p>By triangle inequality,</p><p>Here, the second inequality is due to the first half of <ref type="bibr">(30)</ref>. We further constrain the step size as follows</p><p>C 1 sup P&#8712;Gr(n,r) grad f (P) ,</p><p>so that</p><p>Note that sup P&#8712;Gr(n,r) grad f (P) is finite because grad f is continuous on Gr(n, r) and the Grassmann is compact. (We also exclude the trivial case when grad f (P) &#8801; 0.) Continuing from <ref type="bibr">(32)</ref> and substituting <ref type="bibr">(34)</ref>, we have</p><p>)&#949; t grad f (P t ) = 2 3 &#949; t grad f (P t ) .</p><p>Letting &#954; = inf t&#8594;&#8734; 2 3 &#949; t &gt; 0 ensures that C1 (46) holds. In preparation for C2, we derive an upper bound for P t+1 -P t . The idea is similar to <ref type="bibr">(32)</ref>:</p><p>The second inequality is obtained from the latter half of ( <ref type="formula">30</ref>) and <ref type="bibr">(34)</ref>.</p><p>(C2) Keep the assumption (33) &#949; t . Consider the first-order Taylor's expansion for f on Gr(n, r) [7, section 10.7]. Inserting <ref type="bibr">(31)</ref>, the di"erence of the current and next f is:</p><p>grad f (P t ), P t+1 -P t &#8660; + O P t+1 -P t 2 = &#63728; grad f (P t ), 4&#949; t sym grad f (P t )&#945; t (P t ) &#8224; &#63739; + O &#949; 2 t max grad f (P t ) 3 , grad f (P t ) 2 . (36) Since grad f (P t ) is symmetric, the first term in the result of (36) is &#63728; grad f (P t ), 4&#949; t sym grad f (P t )&#945; t (P t ) &#8224; &#63739; = 4&#949; t &#63734; grad f (P t ), grad f (P t )&#945; t (P t ) &#8224; &#63726; = 4&#949; t &#63734; grad f (P t )P t , grad f (P t )&#945; t (P t ) &#8224; P t &#63726; = 4&#949; t &#63734; grad f (P t )&#945; t (P t ) &#8224; &#945; t (P t ), grad f (P t )&#945; t (P t ) &#8224; P t &#63726; . The second equation is due to &#945; t (P t ) &#8224; P t = &#945; t (P t ) &#8224; . The definition (28) gives &#945; t (P t ) = P t (I n + 2&#949; t w(P t ))P t . Continuing from above, &#63728; grad f (P t ), 4&#949; t sym grad f (P t )&#945; t (P t ) &#8224; &#63739; = 4&#949; t &#63734; grad f (P t )&#945; t (P t ) &#8224; P t (I n + 2w(P t )), grad f (P t )&#945; t (P t ) &#8224; P t &#63726; &#8805; 4&#949; t grad f (P t )&#945; t (P t ) &#8224; P t 2 = 4&#949; t grad f (P t )&#945; t (P t ) &#8224; 2 &#8805; 8 9 &#949; t grad f (P t ) 2 .</p><p>(</p><p>Here we used that I n + 2&#949; t w(P t ) I n for &#949; t &#8805; 0, since w(P t ) &#8712; S 2 (R n ) is PSD. The second inequality is derived from <ref type="bibr">(30)</ref>.</p><p>Considering the remainder term in <ref type="bibr">(36)</ref>, there is a constant C 2 (X X X, r) &gt; 0 such that</p><p>If the step size &#949; t is small enough to satisfy</p><p>where &#915; 2 &gt; 0 is defined in <ref type="bibr">(33)</ref>, then</p><p>Inserting into <ref type="bibr">(38)</ref>,</p><p>The final inequality is by <ref type="bibr">(35)</ref>. Thus C2 holds if we set &#963; = 1</p><p>Proof of Theorem 6.2. Fix step size 0 &lt; &#949; &#8734; &#915; * * . From Theorem 6.1, the sequence {P t } always converges to a first-order critical point of f (23), thus a fixed point of &#966; (21) by C3 in Section 6.2. In this proof, we consider the case when the limit P &#8712; Gr(n, r) is a "bad" critical point, meaning it is a first-order critical point but fails the second-order criticality <ref type="bibr">(25)</ref>. The idea is to apply Theorem A.2 to show that the set of Q 0 that converge to bad critical points has zero measure in St(n, r).</p><p>Because &#966; is a local di"eomorphism by Lemma 6.8, we are eligible to apply Theorem A.2. It gives an open neighborhood B P &#8838; Gr(n, r) around each bad critical point P satisfying the conditions in Theorem A.2. As Gr(n, r) is second-countable, it has the Lindel&#246;f property, i.e. every open cover has a countable subcover. So, there exists a countable set C &#8838; Gr(n, r) consisting of some of the bad critical points such that &#63729; P&#8712;Gr(n,r) is a bad critical point</p><p>Define the subset</p><p>We claim that P 0 &#8712; Gr(n, r) : &#966; t (P 0 ) converges to a bad critical point as t &#8656; &#8704; &#8838; J .</p><p>To see this, assume lim t&#8594;&#8734; &#966; t (P 0 ) = P is bad. By <ref type="bibr">(43)</ref>, P &#8712; B P for some P &#8712; C. Since B P is open, there is t 0 &#8805; 0 such that &#966; t (P 0 ) &#8712; B P for all t &#8805; t 0 . As P is a fixed point of &#966;, by <ref type="bibr">(50)</ref> in Theorem A.2, we obtain &#966; t0 (P 0 ) &#8712; W P . In other words, P 0 &#8712; &#966; -t0 (W P ). So, P 0 &#8712; J as claimed.</p><p>We claim that J is measure-zero in Gr(n, r). For each P &#8712; C, at least one eigenvalue of Hess f (P) is greater than 0. By Lemma 6.7, we know at least one eigenvalue of D&#966;(P) is greater than 1. By the first result of Theorem A.2, the dimension of W P is strictly less than the dimension of Gr(n, r). So W P is measure-zero in Gr(n, r) by <ref type="bibr">[40,</ref><ref type="bibr">Corollary 6.12]</ref>. Given that the step size satisfies 0 &lt; &#949; &#8734; &#915; * * , the map &#966; is a local di"eomorphism by Lemma 6.8, and so is &#966; t for all t &#8805; 0. Thus their pre-images of measure-zero sets have zero measure. As (&#966; t ) -1 &#8801; &#966; -t , the set &#966; -t (W P ) has zero measure for each t &#8805; 0. Then J is a countable union of measure-zero sets, so has zero measure.</p><p>To conclude, define I &#8838; St(n, r) as the pre-image of J &#8838; Gr(n, r) under the map Q &#8656; QQ . Since the map is a submersion from the Stiefel manifold onto the Grassmannian, I has zero measure in St(n, r) as J does in Gr(n, r). For all initializations Q 0 in St(n, r) but outside the null set I, the sequence {&#966; t (P 0 )} = {P t } converges to a second-order critical point of f if 0 &lt; &#949; &#8734; &#915; * * . The proof of Theorem 6.2 is complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion and future work</head><p>In this paper, we developed the PGD method (Algorithm 1) for symmetric Tucker tensor decomposition. It has a simple update scheme compared to other iterative solvers running on Riemannian manifolds. We further designed the SPGD method (Algorithm 2) and SHOEVD solver (Algorithm 3) to decompose the sample moment tensor <ref type="bibr">(1)</ref>. The algorithms are free of constructing and storing moment tensors, and their iterations are conducted by streaming only a small subset of samples. Tremendous computational savings were seen in numerical experiments with minimal loss of accuracy. We also showed the applicability of the Tucker decomposition of moment tensors in comparison to the CP format, as well as to detect anomalies and allocate assets on real datasets. Finally, utilizing Riemannian manifold optimization theory, we derived theoretical guarantees for the PGD sequence on the Grassmannian. The sequence always converges to a first-order critical point, and almost surely to a second-order critical point.</p><p>There are a number of directions that would be interesting to study in the future.</p><p>1. Implementation for cumulants. Cumulant tensors are central to non-Gaussian data analysis. They are nonlinear combinations of moments. It would be useful to derive scalable implementations for sample cumulants.</p><p>2. Extending the analysis. It is certainly worth investigating convergence guarantees in the streaming setting with adaptive step sizes of Algorithms 2 and 3. We also want to explore under what conditions is the limiting point guaranteed to be a global optimizer.</p><p>The gradient of f (22) is &#8711;f (P) = d X X X &#8226; (I n , P, . . . , P) , X X X &#8226; (P, . . . , P)&#8660; -1 + d X X X &#8226; (P, . . . , P) , X X X &#8226; (I n , P, . . . , P)&#8660; -1 = w(P)P + Pw(P). </p><p>Next, we relate the gradients of f (P) and F (Q). The Euclidean gradient of F is</p><p>Therefore, I n -QQ &#8711;F (Q)Q = 2 (I n -P) w(P)QQ = 2 (I n -P) w(P)P.</p><p>By (52), we have grad f</p><p>The proof of Proposition 6.3 is complete.</p><p>C Proof of Lemma 6.4</p><p>Proof. We derive the update on P t (21) based on the iterate Q t <ref type="bibr">(20)</ref>. We decompose the update direction &#949; t &#8711;F (Q t ) into two parts respectively in colspan(Q t ) and colspan(Q</p><p>denotes the QR factorization. (Note this quantity is full-rank according to C3 in Section 6.2 and orth : St(n, r) &#8656; St(n, r) is C &#8734; at full-rank inputs <ref type="bibr">[66]</ref>. ) By the derivative formula of the QR factorization given in [68, Algorithm 1] (originally <ref type="bibr">[66]</ref>),</p><p>To see the second equality, I n -Q t Q t = I n -Q t Q t , because Q t and Q t are both the orthonormal bases for colspan(Q t ). Moreover, given the linear operator s : R r&#8594;r &#8656; R r&#8594;r , 5 the term</p><p>Here the second line refers to the symmetric term</p><p>. We transform the right-hand side of ( <ref type="formula">56</ref>) from an expression in terms of Q to one with P.</p><p>For the first component of ( <ref type="formula">56</ref>), we separate</p><p>Also,</p><p>by the fact that pseudoinversion satisfies X &#8224; Y &#8224; = (YX) &#8224; if X and Y respectively have orthonormal rows and columns. Inserting <ref type="bibr">(54)</ref> gives</p><p>where the last equality is by <ref type="bibr">(53)</ref> and the definition of &#945; t <ref type="bibr">(28)</ref>. Moreover,</p><p>Combining ( <ref type="formula">57</ref>) and ( <ref type="formula">58</ref>),</p><p>The formula of grad f (52) is used in the last equality. Therefore, the first component of (56) reads</p><p>As for second component of (56), note</p><p>= 4 grad f (P t )&#945; t (P t ) &#8224; 2 . 5 The operator s is a composition of element-wise multiplication and skew-symmetrization. See [68, Algorithm 1] for details.</p><p>The of Q t , which preserves norms, gives the first equality. The second is from <ref type="bibr">(59)</ref>. The third component of ( <ref type="formula">56</ref>) is dealt with directly by <ref type="bibr">(27)</ref>,</p><p>Thus, (56) consists of the main component <ref type="bibr">(60)</ref> and the residuals O &#949; 2 t grad f (P t )&#945; t (P t ) &#8224; 2 , O &#949; 2 t grad f (P t ) 2 as needed in <ref type="bibr">(29)</ref>. This completes the proof of Lemma 6.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Proof of Lemma 6.5</head><p>Proof. Consider the step size &#949; t &gt; 0 and projector P &#8712; Gr(n, r). Definition (28) reads &#945; t (P) = P + 2&#949; t Pw(P)P = P (I n + 2&#949; t w(P)) P.</p><p>(</p><p>So colspan(&#945; t (P)) &#8838; colspan(P). Also for any vector x &#8712; colspan(P),</p><p>x &#945; t (P)x = x P(I n + 2&#949; t w(P))Px = x (I n + 2&#949; t w(P))x &#8805; x 2 , as Px = x and I n + 2&#949; t w(P) I n . It follows colspan(&#945; t (P)) = colspan(P), and &#945; t (P) has only r non-zero singular values, each greater than 1. By properties of pseudoinverses, &#945; t (P) &#8224; &#945; t (P) = Proj range(&#945;t(P)) = P,</p><p>and &#945; t (P) &#8224; has only r non-zero singular values, each less than 1. Hence</p><p>Constrain the step size by 0 &lt; &#949; t &#8734; &#915; 1 := 1 4 &#8712; r sup P&#8712;Gr(n,r) Pw(P)P ,</p><p>where sup P&#8712;Gr(n,r) Pw(P)P is finite because the function P &#8656; Pw(P)P is continuous and Gr(n, r) is compact. (We exclude the trivial case Pw(P)P &#8801; 0 when X X X = 0.) From ( <ref type="formula">62</ref>), P&#945; t (P) &#8224; = &#945; t (P) &#8224; &#945; t (P)&#945; t (P) &#8224; P = -&#945; t (P) &#8224; (P&#945; t (P)).</p><p>Then by ( <ref type="formula">63</ref>) and ( <ref type="formula">64</ref>), P&#945; t (P) &#8224; =&#945; t (P) &#8224; (P&#945; t (P)) &#8734; &#945; t (P) &#8224; P&#945; t (P)</p><p>We now compare grad f (P) and grad f (P)&#945; t (P) &#8224; . By <ref type="bibr">(27)</ref>, grad f (P)P = (I n -P)w(P)P.</p><p>Inserting the above equation into <ref type="bibr">(27)</ref>, grad f (P) = sym 2(I n -P)w(P)P = sym 2 grad f (P)P . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Since</head><p>Then due to triangle inequality and (65), grad f (P) &#8734; &#8712; 2 grad f (P)&#945; t (P) &#8224; + &#8712; 2 grad f (P) P&#945; t (P) &#8224; = &#8712; 2 grad f (P)&#945; t (P) &#8224; + &#8712; 2 grad f (P)&#945; t (P) &#8224; P&#945; t (P) &#8224; &#8734; &#8712; 2 grad f (P)&#945; t (P) &#8224; 1 + P&#945; t (P)</p><p>grad f (P)&#945; t (P) &#8224; .</p><p>Inserting <ref type="bibr">(66)</ref> shows the last inequality. Thus, grad f (P)&#945; t (P) &#8224; &#8805; &#8712; 2 3 grad f (P) .</p><p>On the other hand, as &#945; t (P) &#8224; = P&#945; t (P) &#8224; , it holds grad f (P)&#945; t (P) &#8224; &#8734; grad f (P)P &#945; t (P) &#8224; &#8734; &#8712; 2r 2 grad f (P) .</p><p>The results ( <ref type="formula">67</ref>) and ( <ref type="formula">63</ref>) give the last derivation. Combining the above two inequalities completes the proof of Lemma 6.5.</p><p>E Proof of Proposition 6.6</p><p>Proof. Let P &#8712; Gr(n, r) and %P &#8712; T P Gr(n, r). We first show the computation of the Riemannian Hessian. Applying </p><p>Recall (51) that &#8711;f (P) = w(P)P + Pw(P). We di"erentiate &#8711;f (P) to obtain the Euclidean Hessian of f :</p><p>&#8711; </p><p>where the di"erential is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dw(P)[%P]</head><p>= 2d(d -1) sym X X X &#8226; (I n , %P, P, . . . , P) , X X X &#8226; (I n , P, P, . . . , P)&#8660; -1 = d(d -1) X X X &#8226; (I n , (%P)P + P%P, P, . . . , P) , X X X &#8226; (I n , I n , P, . . . , P)&#8660; -1 = d(d -1) X X X &#8226; (I n , %P, P, . . . , P) , X X X &#8226; (I n , I n , P, . . . , P)&#8660; -1 .</p><p>(70)</p><p>The last equality follows from (%P)P + P%P = %P in <ref type="bibr">(24)</ref>. (71)</p><p>We compute the matrix M. Due to the rank constraint on P, we utilize its block structure to facilitate the computation. W.l.o.g., we suppose the rank-r orthogonal projection P is: P = &#63725; I r 0 r&#8594;(n-r) 0 (n-r)&#8594;r 0 (n-r)&#8594;(n-r)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#63736;</head><p>&#8712; Gr(n, r), by choosing an appropriate for R n . From <ref type="bibr">(52)</ref>, grad f (P) = sym 2(I n -P)w(P)P = 0, and ( <ref type="formula">24</ref>), %P = (%P) P + P%P for all %P &#8712; T P Gr(n, r), we obtain block matrices for w(P) and {%P i }. Namely there are PSD matrices A &#8712; S 2 (R r ), C &#8712; S 2 (R n-r ) and matrices H i &#8712; R (n-r)&#8594;r for i &#8712; [r(nr)] such that</p><p>Since %P i are eigenmatrices of Hess f (P), by <ref type="bibr">(41)</ref> we see Here we note that since A is PSD and &#949; &gt; 0, the inverse (I r + 2&#949;A) -1 exists and is symmetric. Thus the left component in the inner product (80) is 2 sym</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>For ease of analysis, we ensure the uniqueness of the QR factorization by using the one where all diagonal entries of the R factor are positive. But none of our algorithms require this choice.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Similar to our algorithm scheme, the comparable algorithms are mostly gradient and Hessian-driven methods. Since we could not identify an implementation for the Jacobi rotation method<ref type="bibr">[31,</ref><ref type="bibr">41]</ref>, it is not included in the comparison.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>The dataset is available at http://xudongkang.weebly.com/data-sets.html.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>The HFRX indices dataset can be found at: https://www.hfr.com/family-indices/hfrx .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>&#8730; r in<ref type="bibr">(47)</ref>. (C3) "&#8658;": From grad f (P) = 0 and (27), we knowgrad f (P)Q = 0 = grad F (Q) = (I n -QQ )&#8711;F (Q),</p></note>
		</body>
		</text>
</TEI>
