<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Sum-of-squares meets square loss: Fast rates for agnostic tensor completion</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10112398</idno>
					<idno type="doi"></idno>
					<title level='j'>Conference on Learning Theory</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Dylan J. Foster</author><author>Andrej Risteski</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study tensor completion in the agnostic setting. In the classical tensor completion problem, we receive n entries of an unknown rank-r tensor and wish to exactly complete the remaining entries. In agnostic tensor completion, we make no assumption on the rank of the unknown tensor, but attempt to predict unknown entries as well as the best rank-r tensor.For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains a "fast" (i.e., O(1 n)-type) rate improving over the rate obtained by reduction to matrix completion. Our prediction error rate to compete with the best d × d × d tensor of rank-r is Õ(r 2 d 3 2 n). We also obtain an exact oracle inequality that trades off estimation and approximation error.Our algorithm is based on the degree-six sum-of-squares relaxation of the tensor nuclear norm. The key feature of our analysis is to show that a certain characterization for the subgradient of the tensor nuclear norm can be encoded in the sum-of-squares proof system. This unlocks the standard toolbox for localization of empirical processes under the square loss, and allows us to establish restricted eigenvaluetype guarantees for various tensor regression models, with tensor completion as a special case. The new analysis of the relaxation complements Barak and Moitra (2016), who gave slow rates for agnostic tensor completion, and Potechin and Steurer (2017), who gave exact recovery guarantees for the noiseless setting. Our techniques are user-friendly, and we anticipate that they will find use elsewhere.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Recovering structured mathematical objects from partial measurements is a fundamental task in machine learning and statistical inference. One important example, which has been a mainstay of modern research in machine learning and high-dimensional statistics, is matrix completion. Here, we receive n entries from an unknown d &#215; d matrix, and the goal is to complete the remaining entries when n is as small is possible.</p><p>The key structural assumption that enables recovery when n &#8810; d 2 is that the underlying matrix is low-rank. A celebrated line of research on matrix completion <ref type="bibr">(Srebro and Shraibman, 2005;</ref><ref type="bibr">Cand&#232;s and Recht, 2009;</ref><ref type="bibr">Cand&#232;s and Tao, 2010;</ref><ref type="bibr">Keshavan et al., 2010;</ref><ref type="bibr">Gross, 2011;</ref><ref type="bibr">Recht, 2011)</ref> has culminated in the following guarantee: to exactly recover an incoherent rank-r matrix, n = &#213;(rd) uniformly sampled entries suffice.</p><p>While low-rank matrix completion has seen successful application across many problem domains (most famously in the context of the Netflix Problem), for many tasks it is natural to consider not just pairwise interactions but higher-order interactions, leading to the problem of tensor completion. Tensor completion poses significant computational hurdles compared to the matrix case, but in an impressive recent work, <ref type="bibr">Potechin and Steurer (2017)</ref> provide an efficient algorithm based on the sum-of-squares hierarchy that exactly recovers a d &#215; d &#215; d tensor of rank-r with incoherent and orthogonal components from &#213;(rd 3 2 ) measurements. While this undershoots the optimal statistical rate of &#213;(rd), there is evidence that this is optimal amongst polynomial time algorithms under certain average-case hardness assumptions <ref type="bibr">(Barak and Moitra, 2016)</ref>.</p><p>In real-world applications, the assumption that the underlying tensor is truly low-rank may be too strong, and model misspecification is unavoidable. This is the main thrust of agnostic learning <ref type="bibr">(Haussler, 1992;</ref><ref type="bibr">Kearns et al., 1994)</ref> which, rather than attempting to recover an unknown model in some class, attempts to predict as well as the best model. The aim of this paper is to develop guarantees for agnostic tensor completion: predicting as well as the best low rank tensor, even when exact recovery is impossible.</p><p>We work in the following agnostic tensor regression model, which captures tensor completion as a special case: we receive examples (X 1 , Y 1 , ), . . . , (X n , Y n ) i.i.d. from an unknown distribution D, where each instance X t is a d &#215; d &#215; d tensor and Y t is a real-valued response. Letting &#10216;&#8901;, &#8901;&#10217; denote the usual inner product, we measure predictive performance of a given d &#215; d &#215; d tensor T via its square loss risk L D (T ) = E (X,Y )&#8764;D (&#10216;T, X&#10217; -Y ) 2 . Our goal is to use the samples to produce a predictor Tn that enjoy low excess risk:</p><p>L D ( Tn )inf</p><p>where the bound &#949;(n, r, d) converges to zero as n &#8594; &#8734;. When the observations X are uniformly distributed indicators (i.e. X = e i &#8855; e j &#8855; e k , where (i, j, k) is uniform) this recovers the usual measurement model for tensor completion. If the model is well-specified in the sense that Y = &#10216;X, T &#8902; &#10217; + &#958;, where T &#8902; is a low-rank tensor and E[&#958; X] = 0, then low excess risk implies approximate recovery of T &#8902; . In general, the guarantee ( <ref type="formula">1</ref>) is interesting because it implies non-trivial predictive performance even in the presence of severe model misspecification.</p><p>In the matrix case, agnostic excess risk guarantees of the type in (1) were characterized by <ref type="bibr">Koltchinskii et al. (2011)</ref>. There it was shown that to compete with the best rank-r matrix with bounded entries, empirical risk minimization with nuclear norm penalization obtains a fast rate of the form &#949;(n, r, d) = O(rd log d n), and also showed that this is optimal. <ref type="foot">1</ref> The nomenclature fast rate is intended to contrast the 1 n dependency on n with slow rates, which have a 1 &#8730; n dependency on n, and which are typically much simpler to obtain (see <ref type="bibr">Gaiffas and Lecu&#233; (2011)</ref> for slow rates in the matrix setting).</p><p>The results of <ref type="bibr">Koltchinskii et al. (2011)</ref> leverage strong understanding of the (matrix) nuclear norm, namely matrix concentration and decomposability/subgradient properties. In this paper, we tackle the following questions:</p><p>&#8226; Can we give similar guarantees for agnostic tensor completion?</p><p>&#8226; Can we obtain fast rates while at the same time relaxing the strong statistical assumptions (low rank observations, incoherence) needed for exact recovery?</p><p>&#8226; What are the best rates we can obtain subject to employing a polynomial-time algorithm?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Our contributions</head><p>Our main result is to give the first polynomial time algorithm with a fast O(1 n)-type rate for agnostic completion of third order tensors that improves over the rate obtained by the natural reduction to matrix completion. Our main theorem gives excess risk bounds relative to low-rank orthogonal tensors, i.e. tensors of the form</p><p>where u i 2 = v i 2 = w i 2 = 1 and {u i } are orthogonal, as are {v i } and {w i }. The result is as follows.</p><p>Theorem 1 (informal). There is a convex set of d&#215; d&#215; d tensors T derived from a sum-of-squares relaxation of the tensor nuclear norm, for which the empirical risk minimizer Tn &#8758;= arg min T &#8712;T 1 n &#8721; n t=1 (&#10216;T, X t &#10217; -Y t )<ref type="foot">foot_1</ref> can be computed in polynomial time, and guarantees that with probability at least 1o(1), L D ( Tn )inf</p><p>(3)</p><p>The guarantee applies to both random indicator measurements (tensor completion) and gaussian measurements (tensor compressed sensing).</p><p>The full version of the theorem is stated in Section 3. The result be thought of as a generalization of the agnostic matrix completion results of <ref type="bibr">Koltchinskii et al. (2011)</ref> to higher-order tensors. We also achieve a more general exact oracle inequality that trades off approximation and estimation error. This takes the form</p><p>where r(T ) denotes the rank of T . 2</p><p>Algorithm. Our algorithm is based on the sum-of-squares (SoS) hierarchy of convex relaxations <ref type="bibr">(Shor, 1987;</ref><ref type="bibr">Parrilo, 2000;</ref><ref type="bibr">Lasserre, 2001)</ref>, applied to the tensor nuclear norm. For k &#8712; N, the sum-of-squares hierarchy defines a sequence of outer convex relaxations &#8901; nuc k that give an increasingly tight approximation to the (hard to approximate <ref type="bibr">(Hillar and Lim, 2013)</ref>) tensor nuclear norm &#8901; nuc as k is increased. The SoS nuclear norm was previously used in the work of <ref type="bibr">Barak and Moitra (2016)</ref>, who gave O(1 &#8730; n) rates for tensor completion in the agnostic model with absolute loss. Their main contribution was to show how certain spectral bounds arising in 3-SAT refutation <ref type="bibr">(Coja-Oghlan et al., 2004)</ref> bound the Rademacher complexity for the unit ball of the degree-six SoS nuclear norm, thereby controlling the usual empirical process via uniform convergence. Our guarantees are based on empirical risk minimization over the (scaled) ball in this norm, specifically the degree-six relaxation &#8901; nuc 6 , and our analysis builds on their Rademacher complexity bounds.</p><p>Restricted eigenvalue and subgradient lemmas. While the O(1 &#8730; n)-type rates provided in <ref type="bibr">Barak and Moitra (2016)</ref> are optimal (in terms of n dependence) for generic Lipschitz losses, it is not immediately obvious whether their result can be used to provide O(1 n)-type fast rates for strongly convex losses like the square loss. It has long been recognized that to obtain fast rates for prediction with strongly convex losses, more refined control of the empirical process is necessary. In particular, it is well-known that local Rademacher complexities and related fixed point complexities characterize the rates for empirical risk minimization in a data-dependent fashion. To bound such localized complexities under convex relaxations, the typical approach is to establish a restricted eigenvalue property for the empirical design matrix <ref type="bibr">(Negahban et al., 2012;</ref><ref type="bibr">Bartlett et al., 2012;</ref><ref type="bibr">Lecu&#233; and</ref><ref type="bibr">Mendelson, 2017, 2018)</ref>. Establishing such guarantees for regularizers such as the &#8467; 1 norm, nuclear norm, and so on is usually done by appealing to properties of the subgradient of the norm and proving that the norm (approximately) decomposes across certain subspaces <ref type="bibr">(Negahban et al., 2012)</ref>. The main tool we establish here, which allows us to unlock the full power of localized complexities and establish rates fast rates, is a new guarantee of this type for the SoS nuclear norm. Informally, we show: Theorem 2 (informal). Let T &#8902; be a fixed orthogonal tensor, and let T = {T &#8758; T nuc 6 &#8804; T &#8902; nuc 6 }. Then:</p><p>This theorem is a consequence of a more general result we prove in Section 2, which gives a characterization of the subgradient of &#8901; nuc 4 at any orthogonal tensor. The basic idea is to show that a characterization for the subgradient of the tensor nuclear norm given in <ref type="bibr">Yuan and Zhang (2016)</ref> can be captured in the sumof-squares proof system. The final result (Theorem 4) is fairly user-friendly, and packages the complexity of SoS into a self-contained statement about geometric properties of the norm &#8901; nuc 4 . We hope that this will find use in other applications. As one concrete example, in Section 3 we show how the subgradient characterization can also be used to give agnostic fast rates for the problem of low-rank tensor sensing with gaussian measurements. Here we also obtain &#213;(r 2 d 3 2 n)-type fast rates.</p><p>Lower bounds. Lastly, we prove that our results can't be strengthened significantly while still obtaining computationally efficient algorithms. It is straightforward to show that there is an inefficient predictor that obtains O(rd n) excess risk, whereas the dimension scaling in Theorem 1 is O(d 3 2 ). This scaling is shared by the other results based on the SoS nuclear norm <ref type="bibr">(Barak and Moitra, 2016;</ref><ref type="bibr">Potechin and Steurer, 2017)</ref>, and <ref type="bibr">Barak and Moitra (2016)</ref> show that finding relaxations for which the Rademacher complexity grows as o(d 3 2 ) is at least as hard as refuting random instances of 3-XOR. In Section 4 we give a computational lower bound for agnostic learning that shows that obtaining square loss excess risk scaling as o(d 3 2 ) is at least as hard as a certain distinguishing problem for random 3-XOR, sometimes called learning sparse parities with noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Related work</head><p>Early algorithms for computationally efficient tensor completion relied on unfolding: reshaping the tensor into a matrix and applying a matrix completion algorithm <ref type="bibr">(Tomioka et al., 2011;</ref><ref type="bibr">Tomioka and Suzuki, 2013;</ref><ref type="bibr">Romera-Paredes and Pontil, 2013;</ref><ref type="bibr">Mu et al., 2014;</ref><ref type="bibr">Jain and Oh, 2014)</ref>. This approach yields suboptimal results for third-order tensors and other odd-order tensors. For example, for a third-order tensor in (R d ) &#8855;3 , the most "balanced" unfolding of the tensor is a d &#215; d 2 matrix, and so directly reducing to an algorithm for agnostic matrix completion (e.g. <ref type="bibr">Koltchinskii et al. (2011)</ref>) would yield suboptimal O(d 2 )-type sample complexity.</p><p>Recent results of <ref type="bibr">Montanari and Sun (2016)</ref>, <ref type="bibr">Potechin and</ref><ref type="bibr">Steurer (2017), and</ref><ref type="bibr">Xia et al. (2017)</ref> all give sub-O(d 2 ) type rates, but apply only to noiseless or well-specified models, and do not obviously extend to the agnostic setting. <ref type="bibr">Potechin and Steurer (2017)</ref> give exact completion in the noiseless case after &#213;(rd 3 2 ) entries are observed. <ref type="bibr">Montanari and Sun (2016)</ref> show that a refined spectral approach based on unfolding can obtain sub-O(d 2 ) rates for prediction error in the noiseless setting, but they obtain a rate of O(1 n 2 3 ) that falls short of the O(1 n)-type rate we provide. Finally, <ref type="bibr">Xia et al. (2017)</ref> recently showed that an algorithm based on power iteration provides O(d n)-type rates once n = &#8486;(d 3 2 ), but their result only applies to well-specified models, and it seems unlikely that this algorithmic approach succeeds in the fully agnostic setting.</p><p>Our results build on the seminal work of <ref type="bibr">Barak and Moitra (2016)</ref> and <ref type="bibr">Potechin and Steurer (2017)</ref>, both of which use sum-of-squares to give O(d 3 2 )-type guarantees that improve on unfolding. The former was the first paper to study the agnostic setting, and gave slow excess risk guarantees for the absolute loss (i.e., rates growing as 1 &#8730; n ). Technically, our results build on their Rademacher complexity bounds for the SoS norms <ref type="bibr">(Barak and Moitra, 2016)</ref>, as well as spectral bounds from <ref type="bibr">Hopkins et al. (2015)</ref>. Obtaining fast rates, however, requires developing new technical tools and necessitates that we control the subgradient of the SoS norms. Our analysis here builds on ideas used to construct dual certificates for tensor completion in <ref type="bibr">Potechin and Steurer (2017)</ref>.</p><p>Lastly, we mention that various recent works have begun to explore that power of SoS in other agnostic learning settings. Notably, <ref type="bibr">Klivans et al. (2018)</ref> provide square loss risk bounds for SoS algorithms for robust regression. To the best of our knowledge our work is the first to provide fast rates for SoS algorithms in any agnostic setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Preliminaries</head><p>We let &#8901; p denote the &#8467; p norm, i.e. if x &#8712; R d is a vector then x p = &#8721; d i=1 x i p 1 p . For a matrix A we let A op denote the operator norm/spectral norm and let A nuc denote the nuclear norm. For matrices or tensors we let &#8901; F denote the element-wise &#8467; 2 norm. For any norm &#8901; we let &#8901; &#8902; denote the dual.</p><p>We use &#213; to suppress factors logarithmic in d, r, and 1 &#948;, where &#948; is the failure probability.</p><p>Tensor notation. The outer product between two vectors u &#8712; R d 1 and v &#8712; R d 2 is denoted u &#8855; v, and belongs to the space R d 1 &#8855; R d 2 . For a given vector u, we write u &#8855;k = u &#8855; &#8943; &#8855; u (k times), and likewise define (R d ) &#8855;k = R d &#8855; &#8943;&#8855; R d . This paper develops algorithms for completion of 3-tensors in (R d ) &#8855;3 , which we frequently identify with elements of R d &#215; R d &#215; R d . In more detail, for a tensor T &#8712; (R d ) &#8855;3 , we let T i,j,k be such that T = &#8721; i,j,k T i,j,k &#8901; e i &#8855; e j &#8855; e k . For a pair of matrices A, B, we let A &#8855; B denote the Kronecker product, which obeys the relation</p><p>Whenever T is an orthogonal tensor of the form (2), we let r(T ) denote the rank.</p><p>2 Subgradient of the sum-of-squares nuclear norm 2.1 Tensor nuclear norm and sum-of-squares relaxation</p><p>In the classical results on matrix completion (Cand&#232;s and <ref type="bibr">Recht, 2009;</ref><ref type="bibr">Cand&#232;s and Tao, 2010;</ref><ref type="bibr">Recht, 2011;</ref><ref type="bibr">Koltchinskii et al., 2011)</ref>, a central object is the nuclear norm, which arises as a convex relaxation of the rank. A natural candidate to develop efficient algorithms for tensor completion is the tensor nuclear norm (e.g. Hillar and Lim (2013); <ref type="bibr">Friedland and Lim (2018)</ref>), which may be defined via</p><p>and whose dual is the injective tensor norm X inj = sup x 2 = y 2 = z 2 =1 &#10216;X, x &#8855; y &#8855; z&#10217;. Unfortunately, the optimization problem here-maximizing a degree-three polynomial over the sphere-is intractable in general.</p><p>The approach we take, following <ref type="bibr">Barak and Moitra (2016)</ref> and <ref type="bibr">Potechin and Steurer (2017)</ref>, is to employ the sum-of-squares hierarchy of convex relaxations <ref type="bibr">(Shor, 1987;</ref><ref type="bibr">Parrilo, 2000;</ref><ref type="bibr">Lasserre, 2001)</ref> which provides an increasingly tight sequence of relaxations of the optimization problem in (4). To describe the relaxations, we require the notion of a pseudodistribution. Definition 1 (Pseudodistribution <ref type="bibr">(Barak and Steurer (2016)</ref>)). Let &#181; &#8758; R d &#8594; R be a finitely supported function and let &#7868;&#181; f = &#8721; x&#8712;supp(&#181;) &#181;(x)f (x). &#181; is said to be a degree-k pseudodistribution if &#7868;&#181; 1 = 1 and &#7868;&#181; f 2 &#8805; 0 for all polynomials f of degree at most k 2.</p><p>Given a degree-s pseudodistribution &#181; and sytem of polynomial inequalities A = {f 1 &#8805; 0, . . . , f m &#8805; 0} &#8746; {g 1 = 0, . . . , g m &#8242; = 0}, we write &#181; &#8871; A if for all S &#8838; [m] and all sum-of-squares polynomials h such that</p><p>and &#7868;&#181; [g i q] = 0 for all i &#8712; [m &#8242; ] and all polynomials q such that deg(g i q) &#8804; &#8467;.</p><p>With the pseudodistribution formalism, we define the degree-k SoS injective norm as follows:<ref type="foot">foot_2</ref> </p><p>The degree-k SoS nuclear norm is simply defined as the dual:</p><p>and then</p><p>The SoS nuclear norm and injective norm can be evaluated in d O(k) time <ref type="bibr">(Gr&#246;tschel et al., 1981;</ref><ref type="bibr">Barak and Steurer, 2016)</ref>. Moreover, the norms obey the ordering T nuc &#8805; . . . &#8805; T nuc k &#8805; . . . &#8805; T nuc 2 &#8805; T F , and likewise</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Subgradient and norm compatibility</head><p>Our algorithms are based on empirical risk minimization with SoS nuclear norm constraints, i.e., algorithms that minimize the empirical loss over the set T = T &#8712; (R d ) &#8855;3 T nuc k &#8804; &#964; for appropriate choice of &#964; and k. The technical challenge to analyzing this type of relaxation is that even if measurements are realized by a rank-r tensor, there is nothing that guarantees a-priori that the tensor Tn output by the algorithm is itself low rank. Letting T be a rank-r orthogonal tensor, our main technical result here shows that if &#964; = T nuc k , then for all elements of T &#8242; &#8712; T , the error &#8710; = T &#8242; -T is "approximately low rank" in a sense that suffices to guarantee good generalization performance. Theorem 3 (Formal version of Theorem 2). Let k &#8805; 4. Let T &#8712; (R d ) &#8855;3 be an orthogonal rank-r tensor, and let T &#8242; &#8712; (R d ) &#8855;3 be an arbitrary tensor with</p><p>In the analysis of nuclear norm regularization for matrix completion-and more broadly, throughout highdimensional statistics-the key tool used to establish guarantees along the lines of ( <ref type="formula">9</ref>) is a characterization for the subgradient for the nuclear norm, and the related notion of decomposability <ref type="bibr">(Negahban et al., 2012;</ref><ref type="bibr">Negahban and Wainwright, 2012)</ref>. It is known clasically <ref type="bibr">(Watson, 1992)</ref> that for any matrix W with singular value decomposition W = U &#931;V &#8890; ,</p><p>Our approach in the remainder of this section is to establish a similar result for the subgradient of the SoS nuclear norm &#8901; nuc k at any orthogonal tensor T . From here Theorem 3 will quickly follow.</p><p>As a first step, we need to define certain subspaces and projection operators associated with T .</p><p>Subspaces For the remainder of the section we let T be a rank-r orthogonal tensor as in (2). Define U = span{u i }, V = span{v i }, and W = span{w i }, and note that each subspace has dimension at most r.</p><p>orthogonal projections onto these subspaces and P U &#8869; , P V &#8869; and P W &#8869; be the projections onto the respective orthogonal complements. We define projection operators from (R d ) &#8855;3 to (R d ) &#8855;3 for all 2 3 combinations of subspaces:<ref type="foot">foot_4</ref> </p><p>Lastly, we define two subspaces that play a central role in our analysis:</p><p>One can verify via multilinearity that</p><p>, and that any tensor in the range of Q T &#8741; spans at most r dimensions along at least two modes. We can now state our main theorem for the subgradient.</p><p>Theorem 4 (Subgradient of SoS nuclear norm). Let k &#8805; 4, and let T = &#8721; r i=1 &#955; i &#8901;u i &#8855;v i &#8855;w i be an orthogonal rank-r tensor. Define</p><p>, and for all</p><p>In other words, Theorem 4 states that</p><p>which we may view as a generalization of the matrix subgradient characterization (10). <ref type="bibr">Yuan and Zhang (2016)</ref> proved a similar result for the (exact) tensor nuclear norm. Our proof of Theorem 4 shows that the essence of their proof can be captured by a low-degree sum-of-squares proof. It builds on the approach introduced in Potechin and Steurer (2017) to provide dual certificates for exact tensor completion.</p><p>With the subgradient lemma in hand, the path to the "approximately low rank" result of Theorem 3 is clear. Suppose that T &#8242; nuc k+2 &#8804; T nuc k+2 , and let &#8710; = T &#8242; -T . By appropriately choosing the dual tensor X in (13), we can show that</p><p>which is a consequence of the earlier remark that all the projections used to define Q T &#8741; in ( <ref type="formula">12</ref>) project into r dimensions along at least two modes. This full argument is in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Agnostic tensor completion</head><p>We now state our main learning results, which use the SoS nuclear norm to give efficient algorithms with fast rates for agnostic tensor completion and tensor sensing. For both results we receive observations</p><p>the goal is to obtain low square loss excess risk in the sense of equation ( <ref type="formula">1</ref>). We let &#202;n denote the empirical expectation, which is uniform over the examples {(X t , Y t )} n t=1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Tensor completion</head><p>In the tensor completion model we take observations X t to be of the form X t = e it &#8855; e jt &#8855; e kt , where</p><p>is selected uniformly at random. <ref type="foot">6</ref> In the noiseless or well-specified setting, this corresponds to observing a single entry of an unknown tensor, but we make no assumption on the responses Y t other than boundedness. The main theorem is as follows.</p><p>Theorem 5 (Formal version of Theorem 1). Let &#964; &gt; 0 be fixed. Suppose that Y &#8804; R almost surely, and let Tn be the empirical risk minimizer over the tensor class</p><p>Let us spend a moment interpreting the theorem. First, let T &#8902; = arg min T &#8758;rank-r L D (T ). Then, by setting &#964; = T &#8902; nuc 6 , we are guaranteed that with probability at least 1&#948;,</p><p>More generally, (15) implies an exact oracle inequality <ref type="bibr">(Koltchinskii et al., 2011;</ref><ref type="bibr">Gaiffas and Lecu&#233;, 2011)</ref>: With probability at least 1&#948;, we have</p><p>Let us compare the result in detail with <ref type="bibr">Barak and Moitra (2016)</ref>, which is the only other polynomial time agnostic tensor completion result with sub-O(d 2 ) sample complexity. For general noise distributions, their analysis gives an excess risk bound that scales as &#213;</p><p>. The bound in (15) matches this dependence on all the parameters, but is squared, and is thus always tighter. The result has excess risk against arbitrary tensors, however, while our bound requires orthogonality of the benchmark. We do not know whether this restriction can be removed.</p><p>Interestingly, while <ref type="bibr">Barak and Moitra (2016)</ref> give excess risk bounds against incoherent tensors, we do not require incoherence. This is because we control the complexity of the benchmark through the &#8467; &#8734; norm of the entries rather than through the Frobenius norm; this parallels the situation in the matrix setting <ref type="bibr">(Koltchinskii et al., 2011;</ref><ref type="bibr">Gaiffas and Lecu&#233;, 2011)</ref>. Applying the spectral bounds of <ref type="bibr">Barak and Moitra (2016)</ref> without incoherence requires slightly tightening the analysis.</p><p>Lastly, we remark that the guarantee ( <ref type="formula">16</ref>) requires setting the parameter &#964; based on the norm of the unknown benchmark T &#8902; . It is likely that this can be relaxed by appealing to penalized empirical risk minimization rather than empirical risk minimization as in <ref type="bibr">Koltchinskii et al. (2011)</ref>, but we leave this for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Tensor sensing</head><p>In this setting we give agnostic learning guarantees for a setting we call tensor sensing, which generalizes the matrix compressed sensing setup studied in <ref type="bibr">Negahban and Wainwright (2011)</ref>. We assume that observations X &#8712; (R d ) &#8855;3 have independent entries from N (0 ; 1) and-as in the tensor completion setting-allow Y &#8712; R to be arbitrary. For each tensor T , define R(T ) to be the smallest almost-sure bound on &#10216;T, X&#10217; -Y . As in the tensor completion setup, the main result is a fast rate with &#213; r 2 d 3 2 n -type scaling.</p><p>Theorem 6. Let &#964; &gt; 0 be fixed, and let Tn be the empirical risk minimizer over the tensor class</p><p>for all orthogonal tensors T &#8902; &#8712; (R d ) &#8855;3 for which n = &#8486;(r 2 (T &#8902; )d 3 2 log 1 2 d + log(1 &#948;)) and T &#8902; nuc 6 = &#964; . Overview of analysis We now sketch how the subgradient theorem can be combined with empirical process arguments to prove Theorem 5 and Theorem 6. We follow a generic recipe given in Appendix C.1specifically, Theorem 8-which shows that to control the generalization error of empirical risk minimization, it suffices to bound a certain "offset" or "shifted" empirical process. For any fixed benchmark T &#8902; , the excess risk relative to T &#8902; is bounded as</p><p>The offset process on the right-hand side was used to obtain high-probability fast rates for misspecified models by <ref type="bibr">Liang et al. (2015)</ref>, and its analysis is closely related to that of <ref type="bibr">Lecu&#233; and Mendelson (2013)</ref>; <ref type="bibr">Mendelson (2014)</ref>. To bound the process, it suffices to establish a type of lower isometry/restricted eigenvalue property, which we state here for the case of tensor regression: let X n &#8758; (R d ) &#8855;3 &#8594; R n be the data operator, which maps any tensor T to the sequence &#10216;T, X 1 &#10217;, . . . , &#10216;T, X n &#10217;, and let &#931; = E X [XX &#8890; ] &#8712; R d 3 &#215;d 3 be the covariance matrix for the vectorized measurements. Then it suffices to show that with high probability, the following restricted eigenvalue bound holds:</p><p>where c &gt; 1 &#8730; 2 is a sufficiently large constant.</p><p>Our starting point to establish the guarantee is to borrow a bound from Hopkins et al. ( <ref type="formula">2015</ref>), which states that E X &#297; nj 4 = O(d 3 4 log 1 4 d) under gaussian measurements, and suffices to bound the Rademacher complexity of our tensor class. Using this bound in conjunction with standard gaussian concentration arguments and the "peeling" method (e.g. <ref type="bibr">(Negahban and Wainwright, 2011</ref>)), we prove Theorem 10, which states that with high probability,</p><p>Combined with Theorem 3, which asserts that all</p><p>To establish the analogous bound in the tensor completion model we use the SoS Rademacher bound from <ref type="bibr">Barak and Moitra (2016)</ref>, but utilize somewhat different concentration arguments. Indeed, due to the sparse nature of the measurement distribution one cannot hope to exactly establish the restricted eigenvalue property for X n , and must instead show that it holds up to a small additive error.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Computational lower bounds</head><p>In the rank-one case, the excess risk bound of Theorem 5 scales as &#213;(d 3 2 n), while the excess risk attained by the natural inefficient algorithm scales as &#213;(d n). It is natural to ask whether this O(d 1 2 ) gap can be improved or whether it poses a fundamental barrier. In the slow rate regime, <ref type="bibr">Barak and Moitra (2016)</ref> gave a computational lower bound showing that finding efficiently computable classes of tensors for which the Rademacher complexity grows as o( d 3 2 n) is at least as hard as refuting random instances of 3-XOR with o(d 3 2 ) clauses. In this section we show that this computational hardness is also present in the fast rate regime: Under plausible average-case hardness assumptions, no polynomial time algorithm can obtain a fast rate for square loss scaling as O(d 3 2-&#949; n) for any &#949; &gt; 0.</p><p>Our improper learning lower bound applies to any algorithm that obtains low excess risk in the sense of (15), and states that under conjectured hardness of a certain distinguishing problem for 3-XOR it is not possible to improve the O(d 3 2 ) dependence on dimension.</p><p>We reduce from the 3-XOR problem over variables x &#8712; {&#177;1} d . A 3-XOR instance consists of a sequence of m clauses of the form</p><p>where z ijk &#8712; {&#177;1} is a target. We consider two families of instances:</p><p>&#8226; Planted. Fix an arbitrary assignment a &#8712; {&#177;1} d . Select m triples (i, j, k) uniformly at random with replacement.<ref type="foot">foot_6</ref> For each such triple (i, j, k), include a clause</p><p>with probability &#951;.</p><p>Note, we sample the value z i,j,k for a triple (i, j, k) only once: if the triple is sampled multiple times, the value z ijk will be the same.</p><p>&#8226; Random. Select m triples (i, j, k) uniformly at random with replacement, and take each clause to be x i &#8901; x j &#8901; x k = z ijk , where z ijk is drawn from {&#177;1} uniformly at random. Again, we sample the value z i,j,k for a triple (i, j, k) only once.</p><p>An algorithm for the distinguishing problem takes m clauses as input and outputs either "Planted" or "Random". The algorithm is said to succeed if it outputs "Planted" for planted instances and "Random" for random instances with probability at least 1o(1) over the draw of the instance. Note that the problem becomes easier as &#951; gets smaller, and in particular when &#951; = 0 the problem can be solved in polynomial time using Gaussian elimination.</p><p>Conjecture 1. There is some constant &#951; &lt; 1 4 such that no algorithm that succeeds for the 3-XOR distinguishing problem with m = o(d 3 2 ) runs in polynomial time.</p><p>All known polynomial time algorithms for distinguishing require m = &#8486;(d 3 2 ) clauses, and conjectured hardness of the closely related problem of strong refutation for random 3-XOR for with o(d 3 2 ) clauses has been used as a basis to establish hardness of other learning problems <ref type="bibr">(Daniely, 2016;</ref><ref type="bibr">Raghavendra et al., 2017;</ref><ref type="bibr">Kothari et al., 2017;</ref><ref type="bibr">Feldman et al., 2018)</ref>. Theorem 7. Let &#949; &gt; 0 be fixed. Assuming the 3-XOR distinguishing conjecture, there is no polynomial time algorithm for agnostic tensor completion that guarantees that for any distribution D, with probability at least 1o(1),</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Our results demonstrate the power of the sum-of-squares hierarchy for agnostic statistical learning, and show that sum-of-squares algorithms can obtain fast rates for prediction with the square loss. We hope our work will serve as a starting point for applying sum-of-squares to obtain polynomial time algorithms with fast rates in statistical learning for broader classes of models.</p><p>A few immediate technical questions emerge. Can the dependence on rank in our results be improved? Can the subgradient results be extended to the general undercomplete or even overcomplete case? Can similar agnostic learning results be obtained with a more practical algorithm that does not rely on solving the full sum-of-squares SDP?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Preliminaries</head><p>A.1 Sum-of-squares proof system</p><p>and where and deg(p S &#8719; i&#8712;S f i ) &#8804; &#8467; for all S and deg(q i g i ) &#8804; &#8467; for all i. <ref type="foot">8</ref> We write A &#8866; &#8467; {h &#8805; 0} whenever such a proof exists. Our proofs going forward utilize the well-known duality of SoS proofs and pseudodistributions. See O'Donnell and Zhou (2013) and <ref type="bibr">Barak and Steurer (2016)</ref> for further discussion, as well as inference rules for the SoS proof system.</p><p>We note the following well-known, but useful lemma: Lemma 1 (Pseudo-Cauchy Schwarz). Let f and g be polynomials and let &#8467; = 2 max{deg f, deg g}. Then for any &#951; &gt; 0,</p><p>As a consequence, if &#181; is a degree-s pseudodistribution with s &#8805; &#8467;, then</p><p>A.2 Basic technical results</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.1 Sum-of-squares norms</head><p>We state a few lemmas capturing useful properties of the sum-of-squares norms. Proposition 1. The SoS nuclear norm and SoS injective norm are dual:</p><p>Proof of Proposition 1. It is immediate from the norm definitions that</p><p>The other direction is a consequence of the standard duality theory for finite-dimensional Banach spaces.</p><p>See, e.g., <ref type="bibr">Theorem 15.4 in Rockafellar (1970)</ref>.</p><p>Lemma 2. Let T = &#8721; r i=1 &#955; i &#8901; u i &#8855; v i &#8855; w i be an orthogonal rank-r tensor. Then for all k &#8805; 4.</p><p>Lemma 2 states that the SoS relaxations of the nuclear norm and injective norm are essentially "integral" for orthogonal tensors. Note, this should not be a huge surprise since it is well-known that polynomial time methods such as power iteration succeed at decomposing orthogonal tensors <ref type="bibr">(Kolda and Bader, 2009)</ref>. We should mention it doesn't seem possible to directly apply such results to give agnostic learning guarantees along the lines of our main theorem. While our benchmark is an orthogonal tensor, the data itself may have no orthogonal structure, and thus there is no clear object to which one might apply such a decomposition.</p><p>Proof of Lemma 2. We may assume &#955; i &#8805; 0 without loss of generality. We first prove equality for the injective norms. Let i &#8902; = arg max i&#8804;r &#955; i . As a starting point, for any k we have</p><p>For the upper bound, let x, y, z be indeterminates and-exploiting orthogonality-let us change coordinates such that</p><p>From equation (20), we have</p><p>We also have</p><p>. By the additivity of SoS proofs, and since we have assumed &#955; i &#8805; 0, this implies</p><p>Putting everything together, we see that A &#8866; 4 &#10216;T, x &#8855; y &#8855; z&#10217; &#8804; max i&#8804;r &#955; i . Thus, since the &#8467; 2 norm is preserved under change of basis, it follows that if &#181; is any feasible degree-4 pseudodistribution for the maximization problem (6), we must have &#7868;&#181; &#10216;T, x &#8855; y &#8855; z&#10217; &#8804; max i &#955; i , and so T &#297; nj k &#8804; max i &#955; i .</p><p>We now establish equality for the nuclear norms. We trivially have</p><p>as a feasible solution to the minimization problem in (4). For the other direction, define X &#8902; = &#8721; r i=1 u i &#8855; v i &#8855; w i , and observe that the equality we just established for the injective norm implies X &#8902; &#297; nj k &#8804; 1. Thus, using the duality of the SoS nuclear norm and injective norm from Proposition 1, we have</p><p>where the last equality uses that {u i }, {v i }, and {w i } are all orthogonal.</p><p>Proposition 2. Let k &#8805; 4. For any degree-k pseudodistribution &#181;,</p><p>Furthermore, the following statements hold:</p><p>Proof of Proposition 2. Equation ( <ref type="formula">23</ref>) follows by rescaling a given pseudodistribution &#181; by using x &#8242; =</p><p>x &#7868;&#181; x 2 2 and so forth, so that the pseudodistribution is feasible for the maximization problem (6).</p><p>For (24), let &#181; be a degree-k pseudodistribution with &#181; &#8871; z 2 2 &#8804; 1 . Then, using (23) and the AM-GM inequality we get &#32384;T, &#7868;&#181;</p><p>Using linearity of the pseudoexpectation operator we have &#7868;&#181;</p><p>2 + y 2 2 -&#10216;T, x &#8855; y &#8855; z&#10217; &#8805; 0, and so (24) follows from the duality of pseudoexpectations and sum-of-squares proofs. The remaining statements follow by symmetry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.2 Projections</head><p>Here we state some basic results regarding the projection operators defined in Section 2.2. Proposition 3. Let x, y, z &#8712; R d be given. If at least two of the follow conditions hold:</p><p>Proof of Proposition 3. Suppose that x &#8712; U and y &#8712; V. Then P U &#8869; (x) = 0, and so</p><p>We also have P V &#8869; (y) = 0, and so</p><p>The remaining cases follow by symmetry.</p><p>Lemma 3. Let k &#8805; 4. For any tensor X &#8712; (R d ) &#8855;3 , and any subspaces U, V, W we have</p><p>and in particular</p><p>Proof of Lemma 3. For any degree-4 pseudodistribution &#181; over indeterminates x, y, z we have</p><p>where the first inequality uses Proposition 2 and the second uses that &#8866; 2 P X x 2 2 &#8804; x 2 2 for any subspace X. This establishes the first result. Now observe from (11</p><p>We thus obtain the second result by applying the first to each of the summands.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.3 Flattenings</head><p>The multilinear rank of a tensor T &#8712; (R d ) &#8855;3 is the triple (r 1 , r 2 , r 3 ), where</p><p>is the dimension of the space spanned by the mode-1 fibers and r 2 (T ) and r 3 (T ) are defined likewise for the second and third mode.</p><p>We define the ith flattening map &#9837;i &#8758;</p><p>A standard result is that rank(&#9837; i (T )) = r i (T ) <ref type="bibr">(Friedland and Lim, 2018)</ref>. We also have the following comparison between the nuclear norm of the tensor and its flattenings. Lemma 4 <ref type="bibr">(Friedland and Lim (2018)</ref>, Theorem 9.4). For any tensor T &#8712; (R d ) &#8855;3 ,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.4 Concentration</head><p>Lemma 5 (Talagrand-type concentration for supremum of empirical process). Let F be a class of functions of the form f &#8758; Z &#8594; R. Let z 1 , . . . , z n be sampled i.i.d. from a distribution D over Z that satisfies E[f (z)] = 0 and has f (z) &#8804; c almost surely. Let &#963; 2 = sup f &#8712;F E f 2 (z). Then for any &#948; &gt; 0, with probability at least 1 -2&#948; over the i.i.d. draw of z 1 , . . . , z n ,</p><p>Proof of Lemma 5. Follows from Theorem A.1 of <ref type="bibr">Bartlett et al. (2005)</ref> applied to the classes F and -F separately, along with the standard in-expectation symmetrization lemma for uniform convergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proofs from Section 2</head><p>The main result in this section is to prove Theorem 4, then use this result to prove Theorem 3. Before proceeding to the main proofs we state an intermediate result.</p><p>Lemma 6 <ref type="bibr">(Potechin and Steurer (2017)</ref>). Let x, y, z &#8712; R d be indeterminates. Let A = y 2 2 = 1 . Then for any r &#8712; [d],</p><p>Proof of Corollary 1. We will show that A &#8866; 6 -&#8721; d i=1 &#8721; j&#8800;i y 2 i y 2 j x 2 2 = -&#8721; d i=1 &#8721; j&#8800;i y 2 i y 2 j ; the term involving z 2 2 follows from the same reasoning. The desired inequality is equivalent to &#8721; d i=1 &#8721; j&#8800;i y 2 i y 2 j ( x 2 2 -1) = 0, which is clearly the product of a degree-4 polynomial and the equality constraint x 2 2 -1 = 0 .</p><p>Proof of Theorem 4. Preliminaries. We first claim that the following equalities hold:</p><p>Indeed, it is immediate from the definition of X &#8902; that Q 0 T &#8741; (X &#8902; ) = X &#8902; , and it follows from Lemma 2 that X &#8902; &#297; nj k = 1 and &#10216;X &#8902; , T &#10217; = T nuc k for all k &#8805; 4.</p><p>Bounding dual norm is sufficient. To establish the inequality (13), we reduce to a simpler problem. The claim is as follows: Fix a constant &#945; &gt; 0. If for all X &#8712; (R d ) &#8855;3 with X &#297; nj k &#8804; &#945;, we have</p><p>then ( <ref type="formula">13</ref>) holds for all X with X &#297; nj k &#8804; &#945;. To see that this is the case, observe that for any T &#8242; and all such X we have</p><p>where the first inequality uses Proposition 1, the second inequality uses ( <ref type="formula">27</ref>), and the final equality uses the definition of X &#8902; and that &#10216;Q T &#8869; (X), T &#10217; = &#10216;X, Q T &#8869; (T )&#10217; = 0. Rearranging the inequality yields (13).</p><p>Bounding the dual norm. The remainder of the proof establishes that ( <ref type="formula">27</ref>) holds for &#945; = 1 64.</p><p>Let x, y, z &#8712; R d be indeterminates. We will provide a degree-(k + 2) SoS upper bound on the polynomial &#10216;X &#8902; + Q T &#8869; (X), x &#8855; y &#8855; z&#10217;, which will suffice to establish (27).</p><p>Let</p><p>), and W = span({w i } r i=1 ), so that {u i } r i=1 is a basis for U, and likewise for the other modes. Let {u i } d i=r+1 be an arbitrary orthonormal basis for U &#8869; and likewise with {v i } d i=r+1 for V &#8869; , and {w i } d i=r+1 for W &#8869; . We perform a change of basis and let u i = v i = w i = e i , where e i is the ith standard basis vector. Then with x, y, z expressed in the new basis we can write</p><p>We now handle the second term in (28). We will establish that</p><p>under the assumption that X &#297; nj k &#8804; &#945;. To do this it suffices show that for each i individually,</p><p>with an extra additive factor of O(&#945;) &#8901; y 2 i (x 2 i + z 2 i ) when i &gt; r. Let 1 &#8804; i &#8804; d be fixed and let x &#8242; = xx i e i , y &#8242; = yy i e i , and z &#8242; = zz i e i . We write</p><p>Case: i &#8804; r. Observe that with our change of basis we have e i &#8712; U along the first mode, e i &#8712; V along the second mode, and e i &#8712; W along the third mode. In view of Proposition 3, this means we have</p><p>Consequently, using multilinearity we can write</p><p>In what follows we will repeatedly invoke that</p><p>2 &#8804; 1, and so forth. We will also use that if X &#297; nj k &#8804; &#945; then-via Proposition 2 and Lemma 3-for any indeterminates a, b, c,</p><p>These inequalities allow us to bound the terms in (31) as follows:</p><p>Adding these inequalities, we get</p><p>and by the multiplication rule for SoS proofs,</p><p>Case: i &gt; r. As in the previous case, we split the expression &#10216;Q T &#8869; (X), (x i e i + x &#8242; ) &#8855; (y i e i + y &#8242; ) &#8855; (z i e i + z &#8242; )&#10217; in (30) using multilinearity, however we can no longer argue that four of the eight terms vanish. As a starting point, for the four terms that appeared in the i &#8804; r case we can adopt the same upper bound to get</p><p>We bound the four remaining terms as follows:</p><p>Adding together all of these inequalities, we get</p><p>and</p><p>Putting everything together. Taking the inequalities we proved for the individual coordinates i and summing them up, we have</p><p>Since y 2 2 = 1 &#8834; A, we get</p><p>Returning to (28), this inequality plus the earlier bound from Corollary 1 imply</p><p>By the duality of SoS proofs and pseudodistributions, we have</p><p>&#8804; 1 as desired.</p><p>Proof of Theorem 3. We first establish equation ( <ref type="formula">14</ref>). We combine the assumption that T &#8242; nuc k+2 &#8804; T nuc k with equation ( <ref type="formula">13</ref>) to get</p><p>for all X with X &#297; nj k &#8804; 1 64 and X &#8902; as in Theorem 4. Rearranging, this yields</p><p>We now use that X &#8902; &#297; nj k &#8804; 1 (from Theorem 4) and choose X to be a point obtaining the supremum in</p><p>We now establish equation ( <ref type="formula">9</ref>). Observe that we can write</p><p>Combining this with ( <ref type="formula">14</ref>), we get</p><p>Using the triangle inequality, we upper bound the first term as</p><p>To proceed, we flatten each tensor in the summation above into a matrix and use Lemma 4 to show that the nuclear norm of the flattening leads to an upper bound. Let r 1 = dim U, r 2 = dim V, and r 3 = dim W. Then the following inequalities hold</p><p>The first inequality in each line above follows from Lemma 4 and the definitions in ( <ref type="formula">11</ref>). The second follows from the fact that A nuc &#8804; rank(A) A F for any matrix A, along with the fact that rank(&#9837; i (T )) = r i (T ) for any tensor, and that flattening does not change the entrywise &#8467; 2 norm. We have</p><p>F as well by the same argument, though the choice of r 1 r 2 r 3 in this case is arbitrary. To combine all the bounds, we use that r 1 , r 2 , r 3 &#8804; r and that orthogonal projection decreases the &#8467; 2 norm, which yields</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Proofs from Section 3</head><p>This section of the appendix is structured as follows.</p><p>First, in Appendix C.1, we provide we provide a generalization bound for general classes of tensors and measurement models, from which all of our main statistical results will follow as special cases. This bound assumes that a restricted eigenvalue-type property holds for the tensor class and measurement model under consideration.</p><p>In Appendix C.2 and Appendix C.3 we establish that this restricted eigenvalue property holds for the measurement models in Section 3.</p><p>In Appendix C.4 we combine these results to prove the main results of that section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Agnostic generalization bounds for tensor classes</head><p>In this section we given generalization guarantees for empirical risk minimization in a general learning setup. We receive a set of examples S &#8758;= (X 1 , Y 1 ), . . . , (X n , Y n ) i.i.d. from a distribution D over (R d ) &#8855;3 &#215; R. We assume that a convex class of tensors eT &#8838; (R d ) &#8855;3 is given, and that our goal is to achieve excess risk against an unknown benchmark T &#8902; &#8712; T . We analyze the performance of empirical risk minimization over T :</p><p>where Ln is the empirical square loss. The main result from this section is Theorem 8, which bounds the performance of ERM under various assumptions on the data distribution, the class T , and the benchmark T &#8902; . To state the result, recall from Section 3 that &#931; is the population correlation matrix and X n is the empirical design operator. Theorem 8. Let a benchmark T &#8902; &#8712; T be fixed, and let &#958; t = (Y t -&#10216;T &#8902; , X t &#10217;). Suppose there exist a pair of dual norms &#8901; and &#8901; &#8902; for which the following conditions hold:</p><p>1. There are constants 0 &#8804; c &lt; 2, &#947; n &gt; 0, and &#948; 0 &#8805; 0 such that with probability at least 1&#948; 0 ,</p><p>2. There are constants M &#8805; 0 and &#948; 1 &#8805; 0 such that with probability at least</p><p>3. There is a constant &#954; &#8805; 0 such that</p><p>Then with probability at least 1 -(&#948; 0 + &#948; 1 ),</p><p>Proof of Theorem 8. Since T is convex, and since Tn minimizes the (strongly convex) empirical risk, we have Ln (T &#8902; ) -Ln ( Tn ) &#8805; &#32384;&#8711; Ln ( Tn ),</p><p>By rearranging and expanding the definition of Ln , this implies</p><p>Since the left-hand side is non-negative, we can add it to the population excess risk, which implies</p><p>Rearranging, this is equal to</p><p>Since Tn -T &#8902; is an element of T -T &#8902; , we move to an upper bound by taking a supremum over elements of this set. &#8804; sup</p><p>This establishes inequality (18). It remains to use the assumptions in the theorem statement to bound the process. Property 1 states that with probability at least 1&#948; 0 , E&#10216;&#8710;, X&#10217; 2 &#8804; c &#8901; &#202;n &#10216;&#8710;, X&#10217; 2 + &#947; n for all &#8710; &#8712; T -T &#8902; . Define c &#8242; = 2c &gt; 0, so that conditioned on this event we have sup</p><p>We expand the first term in the supremum as</p><p>It follows from H&#246;lder's inequality that this expression is bounded as</p><p>, the development so far states that with probability at least 1&#948; 0 ,</p><p>Using Property 3 we upper bound the leading term by</p><p>and the event we already conditioned on implies that this is at most</p><p>where the second inequality follows from AM-GM. Finally, by Property 2 we have that with probability at least 1&#948; 1 , &#968; n &#8804; M &#8730; n, which leads to the final bound of</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Restricted eigenvalue for tensor completion</head><p>The main result in this section is Theorem 9, which relates the empirical covariance and population covariance for all tensors with bounded entries and bounded SoS nuclear norm under sampling model for tensor completion. This result is then used in the proof of Theorem 5 to establish a restricted eigenvalue guarantee.</p><p>Theorem 9. Let d 3 2 &#8804; n &#8804; d 3 , and let &#949; &#8712; (0, 1 2). Suppose observed entries are drawn with replacement. Then for any &#948; &gt; 0, with probability at least</p><p>The proofs in this section pass back and forth between the with-replacement sampling model for tensor completion used in the main body of the paper and a without-replacement model, in which each entry is only observed a single time for tensor completion. To distinguish between the models, we use S &#8764; D n wr to refer to the draw of the dataset under the with-replacement sampling model and S &#8764; D n w o to refer to the draw under the without-replacement model. &#8486; &#8838; [d] 3 will denote the support of the set of observed entries.</p><p>Before proving Theorem 9, we state and prove a number of auxiliary lemmas, from which the main result will follow.</p><p>Proof of Proposition 4. This is an immediate consequence of Bernstein's inequality for without-replacement sampling. See <ref type="bibr">Bardenet et al. (2015)</ref>, Proposition 1.4.</p><p>Lemma 7. The Rademacher complexity of &#8901; &#297; nj 4 under without-replacement sampling is bounded as</p><p>Additionally, if n &#8804; d 3 , then the Rademacher complexity under with-replacement sampling is bounded as</p><p>Proof of Lemma 7. We first bound the Rademacher complexity in the without-replacement sampling case, then handle the with-replacement case by reduction. This analysis follows <ref type="bibr">Barak and Moitra (2016)</ref>, except that we handle certain "diagonal" terms that arise in the analysis slightly more carefully so as to get the right scaling for our setup, which differs from theirs in that it does not assume incoherence.</p><p>Consider a fixed draw of &#491; 1&#8758;n and X 1&#8758;n , and let Z = &#8721; n t=1 &#491; t X t . Leting x, y, z be indeterminates, we will give a degree-four SoS upper bound on the polynomial &#10216;Z, x &#8855; y &#8855; z&#10217;. This will imply that the pseudoexpectation of &#10216;Z, x &#8855; y &#8855; z&#10217; is bounded for any feasible pseudodistribution in the maximization problem defining &#8901; &#297; nj 4 .</p><p>To begin, for any fixed constant &#951; &gt; 0, Lemma 1 implies that</p><p>and so</p><p>and</p><p>We bound the first term by the operator norm of B via</p><p>For the second term, define another matrix</p><p>By the duality of sum-of-squares proofs and pseudodistributions, this implies that any degree-four pseu-</p><p>Optimizing over &#951;, we conclude that</p><p>The bound for the matrix B is taken care of Theorem 4.4 of <ref type="bibr">Barak and Moitra (2016)</ref>, which implies that</p><p>. For the matrix R, observe that under the without-replacement sampling model, for any j, k we have</p><p>Proposition 4 thus implies that for any fixed j, k, with probability at least 1&#948;,</p><p>and so, by taking a union bound and integrating out the tail, we have</p><p>Combining the bounds on B and R and using Jensen's inequality yields</p><p>This completes the bound for the without-replacement case. For the with-replacement case we reduce to the bound on the second moment above. Condition on the draw of S, and let</p><p>Then we have</p><p>Introduce a new sequence of Rademacher random variables &#963; &#8712; {&#177;1} d 3 . Then the right-hand side above is equal to</p><p>, where the inequality follows from the standard Lipschitz contraction lemma for Rademacher complexity <ref type="bibr">(Ledoux and Talagrand, 1991)</ref>. By the Hoeffding bound, we have that for any fixed index (i, j, k) with probability at least 1&#948;,</p><p>Taking a union bound and integrating out the tail, we have</p><p>We now move to the final bound by taking the expectation over S. Using Cauchy-Schwarz, the development above implies</p><p>Bernstein's inequality and the union bound imply that</p><p>where the second inequality uses that n &#8804; d 3 . For the second term, we have</p><p>where the second inequality uses the bound we proved for the without-replacement setting and the second inequality uses m &#8804; n and that the probabilities sum to one. Combining these two bounds completes the proof.</p><p>Under the with-replacement sampling model, when d 3 2 &#8804; n &#8804; d 3 , we have that for any &#948; &gt; 0, with probability at least 1&#948;,</p><p>Proof of Lemma 8. To begin, we write</p><p>Using this representation, since entries are drawn i.i.d. with replacement, we can apply Lemma 5 with the function class F = X &#8614; &#10216;T, X&#10217; 2 -E&#10216;T, X&#10217; 2 T &#8712; T . In particular, note that for all T &#8712; T we have &#10216;T, X t &#10217; = T it,jt,kt &#8804; R, and furthermore</p><p>Consequently, Lemma 5 implies that with probability at least 1&#948;, for all T &#8712; T ,</p><p>Using Jensen's inequality and splitting the supremum, we have</p><p>Using the Lipschitz contraction lemma for Rademacher complexity, we remove the square</p><p>Finally, using Lemma 7, we have</p><p>The final bound follows by using n &#8805; d 3 2 to simplify this expression.</p><p>Proof of Theorem 9. Let T = T &#8712; (R d ) &#8855;3 T &#8734; &#8804; R . Recall that for tensor completion, the population correlation matrix &#931; under with-replacement sampling is equal to 1</p><p>n, and let N = &#8968;log(&#964; max &#964; min )&#8969; + 1 and M = &#8968;log(&#946; max &#946; min )&#8969; + 1. For each i &#8712; [N ] and j &#8712; [M ] define &#964; i = &#964; max e 1-i and &#946; j = &#946; max e 1-j . Define</p><p>Using Lemma 8 and a union bound, we get that with probability at least 1&#948;, for all i, j simultaneously, for all T &#8712; T i,j ,</p><p>Now consider a fixed tensor T &#8712; T . There are two cases. First, if T nuc 4 &#8805; &#964; min and &#931; 1 2 T F &#8805; &#946; min , then there must be indices i and j for which &#964; i+1 &#8804; T nuc &#8804; &#964; i , &#946; j+1 &#8804; &#931; 1 2 T F &#8804; &#946; j . Consequently, the uniform bound above implies</p><p>On the other hand, if either T nuc &#8804; &#964; min or &#931; 1 2 T F &#8804; &#946; min , we trivially have</p><p>Combining these cases, and using the values for N and M , we get that with probability at least 1&#948;, for all T &#8712; T ,</p><p>Using the AM-GM inequality on the second-to-last term and rearranging, we have that for any &#949; &gt; 0,</p><p>When &#949; &#8712; (0, 1 2), this implies</p><p>In the remainder of the section we prove Lemma 9. The result follows from fairly standard techniques (e.g. <ref type="bibr">Wainwright (2019)</ref>), but we include the proof for completeness. We first restate some basic results on gaussian concentration. Lemma 11 (Concentration for Lipschitz functions <ref type="bibr">(Milman and Schechtman, 1986)</ref>). Let Z &#8712; R n have entries drawn i.i.d. from N (0 ; 1), and let f &#8758; R n &#8594; R be L-Lipschitz with respect to the &#8467; 2 norm. Then for all t &#8805; 0, <ref type="bibr">Lemma 12 (Gordon's Inequality (Davidson and Szarek, 2001)</ref>). Let Z a,b and Y a,b be zero-mean gaussian processes indexed by A &#215; B. Suppose that</p><p>Proof of Lemma 9. Part 1: Bound at a single scale.</p><p>We will prove that with probability at least 1 -2e</p><p>We will prove a lower bound on the random variable min &#8710;&#8712;B(&#964; )</p><p>1 &#8730; n X n (&#8710;) 2 , which is equivalent to providing an upper bound on the random variable</p><p>As a starting point, we show how to upper bound the expectation</p><p>. Note that Z &#8710;,u is a gaussian process with variance n -1 , since &#8710; F = 1. We now define a new gaussian process that will serve as an upper bound through Gordon's inequality. Let g &#8712; R p and h &#8712; R n be standard gaussian random variables, and define</p><p>Note that for any (&#8710;, u) and (&#8710; &#8242; , u &#8242; ) we have</p><p>Interpreting &#8710; and &#8710; &#8242; as vectors in R d 3 , we also have</p><p>where we have used that</p><p>It is also easily seen from the representation above that for any triple (&#8710;, u, u &#8242; ) we have equality:</p><p>. This means that the preconditions of Lemma 12 are satisfied, and so</p><p>The provides the desired upper bound in expectation. To establish the high probability result we appeal to gaussian concentration for Lipschitz functions. Let X &#8712; R n&#215;d 3 denote the sequence of measurements X 1 , . . . , X n , interpreted as a matrix with vectorized measurements as rows. Define</p><p>and so f is n -1 2 -Lipschitz with respect to &#8467; 2 . Lemma 11 therefore implies that</p><p>In other words, (39) holds.</p><p>Part 2: Bound at all scales. We will show that for any &#949; &lt; 2 &#960;, with probability at least 1 -2e</p><p>Set &#181; = &#949; 2 and consider the classes B(0, &#181;) and B(2 i-1 &#181;, 2 i &#181;) for i &#8712; N. Note that if equation ( <ref type="formula">40</ref>) fails to hold and &#8710; &#8712; B(0, &#181;) then</p><p>Furthermore, if equation ( <ref type="formula">40</ref>) fails to hold and</p><p>Our development in part 1 of the proof implies that for any fixed &#964; &#8467; &#8804; &#964; u , with probability at least 1 -</p><p>Thus, by a union bound, we get that with probability at least 1 -2e -n 32 &#8721; &#8734; i=0 e -2 2i n&#949; 2 8</p><p>, or conservatively at</p><p>or in other words, equation ( <ref type="formula">40</ref>) holds.</p><p>We extend the guarantee in equation ( <ref type="formula">40</ref>) to arbitrary &#8710; &#8712; R p by rescaling so that &#8710; F = 1 and then exploiting homogeneity to get</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4 Proofs of main results</head><p>Proof of Theorem 5. We prove the theorem by appealing to the generic result of Theorem 8 using norms &#8901; = &#8901; &#297; nj 4 and &#8901; &#8902; = &#8901; nuc 4 . Let T &#8902; be an arbitrary rank-r orthogonal tensor with T &#8902; nuc 6 = &#964; and T &#8902; &#8734; &#8804; R. First, observe that all elements T &#8712; T have T nuc 6 &#8804; T &#8902; nuc 6 . Consequently, Theorem 3 implies that all</p><p>which establishes Property 3 with &#954; = O(r &#8901; d 3 2 ).</p><p>To establish Property 2 we appeal to Theorem 9, which implies that for any &#949; &lt; 1 2, with probability at least</p><p>Using equation ( <ref type="formula">42</ref>) this is upper bounded by</p><p>Using the AM-GM inequality, this is further upper bounded by</p><p>Rearranging, this is equivalent to</p><p>and since &#949; &lt; 1 2 this implies</p><p>Making the somewhat arbitrary choice of &#949; = 1 100, and simplifying the right-hand-side, we get</p><p>So we can take c = 1.1 and</p><p>Finally, we establish Property 1. Lemma 5 establishes that with probability at least 1&#948;,</p><p>Using Lemma 7 with the standard in-expectation Lipschitz contraction lemma (e.g., <ref type="bibr">Ledoux and Talagrand (1991)</ref>), we have</p><p>Since n = &#8486;(d 3 2 ), these bounds together imply that we can take M = O R</p><p>.</p><p>Theorem 8 now implies the claimed result.</p><p>Proof of Theorem 6. As in the tensor completion case, we prove the theorem by appealing to Theorem 8 using norms &#8901; = &#8901; &#297; nj 4 and &#8901; &#8902; = &#8901; nuc 4 .</p><p>Let T &#8902; be an arbitrary orthogonal tensor with</p><p>and thus, since &#931; = I, we may take &#954; = 68r to establish Property 3.</p><p>To establish Property 2 we appeal to Theorem 10. This implies that for any &#949; &lt; 2 &#960;, with probability at</p><p>where C &gt; 0 is some absolute constant. Using equation ( <ref type="formula">43</ref>), we have that for all &#8710; &#8712; T -T &#8902; , conditioned on the event above,</p><p>This means that when n = &#8486;(r 2 (T &#8902; )d 3 2 log 1 2 d &#949;), we have</p><p>It suffices to set &#949; = 1 40 to get</p><p>So, simplifying, Property 2 is satisfied with c = 1.8 and &#947; n = 0 with probability at least 1e -n 64 when n = &#8486;(r 2 (T &#8902; )d 3 2 log 1 2 d).</p><p>To establish Property 1 we use Lemma 10, but some care needs to be taken to establish that this applies with high probability. Pick a constant &#964; &#8805; 0, and observe via Lemma 11 that</p><p>Define a truncated sequence X &#8242; t = X t &#189; X t F &#8804; d 3 2 + &#964; , and set &#948; 0 = 2ne -&#964; 2 2 . Observe that with probability at least 1&#948; 0 , X t = X &#8242; t for all t, and so .</p><p>We apply Lemma 5 to the truncated complexity, which implies that with probability at least 1 -(&#948; + &#948; 0 ),</p><p>where &#963; 2 = sup T nuc 4 &#8804;1 E &#10216;T, X &#8242; &#10217; 2 (&#10216;T &#8902; , X t &#10217; -Y t ) 2 &#8804; sup T nuc 4 &#8804;1 T 2 F &#8901; 4R 2 &#8804; 4R 2 . We now apply Lipschitz contraction to bound the in-expectation Rademacher complexity as</p><p>.</p><p>Integrating out the tail, we can bound the error due to truncation as</p><p>for some absolute constant C. We pick &#964; = ( (d 3 2 n + log(1 &#948;)) C ), so that the quantity above is o(1) and &#948; 0 &#8804; &#948;. Lastly, since X 1&#8758;n are gaussian, Lemma 10 implies</p><p>When n &#8805; d 3 2 , we have d 3 2 &#8804; d 3 4 &#8730; n, and so we conclude that with probability at least 1&#948;, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Proofs from Section 4</head><p>Proof of Theorem 7. Let &#949; &gt; 0 be fixed and let m = d 3 2-&#949; 2 .</p><p>We will generate an instance of tensor completion from the 3-XOR instance by treating the indices (i, j, k) in each clause as an observed entry. Precisely, for each clause, we set the corresponding X = e i &#8855; e j &#8855; e k , and the corresponding Y = z ijk . The induced risk for each tensor T in this setting is simply L D Z (T ) &#8758;= 1 d 3 &#8721; i,j,k (T i,j,k -Z i,j,k ) 2 , where the tensor Z &#8712; ({&#177;1} d ) &#8855;3 is defined as follows: in the random case, the entries of Z are selected uniformly at random from {&#177;1}; in the planted case with planted assignment a &#8712; {&#177;1} d , we start off with the tensor a &#8855;3 , then get Z by flipping each coordinate independently with probability &#951;.</p><p>If we set n = &#8486;(m), then we are guaranteed to receive &#937;(m) unique clauses under with-replacement sampling, since there are at most logarithmically many repeats with inverse polynomial probability for m = o(d 3 2 ). With this observation, it is clear that any algorithm that takes as input the examples S and outputs "Planted" with probability at least 1o(1) when Z comes from the planted distribution and "Random" with probability at least 1o(1) when Z is from the random distribution is a successful distinguisher for the planted 3-XOR problem with &#920;(m) clauses. Going forward we assume n = &#8486;(d), since this is the interesting regime for distinguishing and refutation.</p><p>Now, suppose we have an algorithm that enjoys the excess risk bound (19) for any n, and let TS denote its output given input dataset S. Since Z &#8712; ({&#177;1} d ) &#8855;3 we can take TS &#8734; &#8804; 1 without loss of generality. We will turn TS into a successful distinguishing algorithm for the planted 3-XOR problem as follows: Define &#947; = 1 -4&#951;, and note the assumption that &#951; &lt; 1 4 implies that &#947; &gt; 0. Split the sample set S into halves S 1 and S 2 , and let LS 1 and LS 2 denote the empirical risk on the respective halves. Let TS 1 be the output of the assumed algorithm on S 1 . If LS 2 ( TS 1 ) &#8805; 1&#947; 4 2 return "Random", else return "Planted".</p><p>We will show that the algorithm succeeds in both the planted and random case by analyzing the value of LS 2 ( TS 1 ) in each case.</p><p>Random case. Let S &#8242; 2 be the result of removing all entries that appear in S 1 from S 2 . The number of repeats is at most O(n 2 d 3 + log(1 &#948;)) with probability at least 1&#948;. It follows that with probability at least, say, 1 -O(d -1 ), S &#8242; 2 S 2 &#8805; 1o(1), and LS 2 ( TS 1 ) &#8805; LS &#8242; 2 ( TS 1 )o(1).</p><p>Observe that for every remaining example X = e i &#8855;e j &#8855;e k in S &#8242; 2 , the value of TS 1 is statistically independent of the value of Z i,j,k . Abbreviating TS 1 to T , this means that we have</p><p>2 X=e i &#8855;e j &#8855;e k &#8712;S &#8242; 2 ( Ti,j,k ) 2 + 1 &#8805; 1.</p><p>Condition on S &#8242; 2 . Since TS 1 and Z have bounded entries, it follows from Hoeffding's inequality that with probability at least 1&#948; over the choice of Z, This proves that our strategy will indeed return "Random" for random instances.</p><p>Planted case. For any Z, the assumed excess risk bound implies that for any rank-1 tensor T &#8902; with bounded entries, we have L D Z ( TS 1 ) -L D Z (T &#8902; ) &#8804; o(1),</p><p>when n = &#969;(d 3 2-&#949; ). We choose T &#8902; = a &#8855;3 , where a &#8712; {&#177;1} d is the planted assignment. Taking expectation over the flips in Z, we have</p><p>Applying Bernstein's inequality, we have that with probability at least 1 -O(d -1 ) over the choice of Z,</p><p>Finally, we use Bernstein once more to show that the empirical loss for S 2 converges to the population loss, leveraging that S 2 is an independent hold-out set for TS 1 . We have that for any choice of S 1 , with probability at least 1&#948; over the draw of S 2 ,</p><p>where c 1 and c 2 are absolute constants. Applying AM-GM, this implies that LS 2 ( TS 1 ) &#8804; (1+&#947; 2 )L D Z ( TS 1 )+ o(1) with probability at least 1-O(d -1 ), and so by combining this with the excess risk bound and the bound on L D Z (a &#8855;3 ), we have LS 2 ( TS 1 ) &#8804; 1&#947; 4 + o(1).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Interestingly, their results also show that incoherence-usually taken as necessary in matrix completion for positive results -is not necessary to obtain prediction error bounds; boundedness suffices.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Here the word "exact" refers to the fact that the leading constant in front of the loss on the right-hand-side is 1; such guarantees generally do not easily follow from the usual machinery used to analyze well-specified models.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>There are minor technical differences (e.g., scaling) between the SoS injective/nuclear norm definitions we use and those of<ref type="bibr">Barak and Moitra (2016)</ref>;<ref type="bibr">Potechin and Steurer (2017)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>This equivalence is proven in Appendix A for completeness.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>Following the convention in Section 1.3, if X = &#8721; i ai &#8855; bi &#8855; ci, then (P U &#8855; P V &#8855; P W )X = &#8721; i (P U ai) &#8855; (P V bi) &#8855; (P W ci). It is also useful to note that for any x, y, z we have &#10216;(P U &#8855; P V &#8855; P W ), x &#8855; y &#8855; z&#10217; = &#10216;X, (P U x) &#8855; (P V y) &#8855; (P W z)&#10217;.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>Note that we sample entries with replacement, whereas related works use without-replacement sampling<ref type="bibr">(Barak and Moitra, 2016;</ref><ref type="bibr">Potechin and Steurer, 2017)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>We work in the slightly non-standard with-replacement model to simplify the mapping onto the with-replacement tensor completion model in Theorem 5.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>We use the convention &#8719; i&#8712;&#8709; fi = 1, so that if &#8866; &#8467; h then h itself is a degree-&#8467; sum of squares.</p></note>
		</body>
		</text>
</TEI>
