<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Adaptive Reduced Rank Regression.</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020 December</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10212766</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Felix Ming Qiong Wu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study the low rank regression problem $\my = M\mx + \epsilon$, where $\mx$ and $\my$ are d1 and d2 dimensional vectors respectively. We consider the extreme high-dimensional setting where the number of observations n is less than d1+d2. Existing algorithms are designed for settings where n is typically as large as $\Rank(M)(d_1+d_2)$. This work provides an efficient algorithm which only involves two SVD, and establishes statistical guarantees on its performance. The algorithm decouples the problem by first estimating the precision matrix of the features, and then solving the matrix denoising problem. To complement the upper bound, we introduce new techniques for establishing lower bounds on the performance of any algorithm for this problem. Our preliminary experiments confirm that our algorithm often out-performs existing baselines, and is always at least competitive.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>We consider the regression problem y = Mx + in the high dimensional setting, where x &#8712; R d 1 is eatures, y &#8712; R d 2 is a vector of responses, M &#8712; R d 2 &#215;d 1 are the learnable parameters, and N &#8764; (0, &#963; 2 I d 2 &#215;d 2 ) is a noise term. High-dimensional setting refers to the case where the number of observations n is insufficient for recovery and hence regularization for estimation is necessary <ref type="bibr">[26,</ref><ref type="bibr">30,</ref><ref type="bibr">12]</ref>. This high-dimensional model is widely used in practice, such as identifying biomarkers <ref type="bibr">[ 48]</ref>, understanding risks associated with various diseases <ref type="bibr">[18,</ref><ref type="bibr">7]</ref>, image recognition <ref type="bibr">[34,</ref><ref type="bibr">17]</ref>, forecasting equity returns in financial markets <ref type="bibr">[33,</ref><ref type="bibr">39,</ref><ref type="bibr">28,</ref><ref type="bibr">8]</ref>, and analyzing social networks <ref type="bibr">[46,</ref><ref type="bibr">35]</ref>.</p><p>We consider the "large feature size" setting, in which the number of features d 1 is excessively large and can be even larger than the number of observations n . This setting frequently arises in practice because it is often straightforward to perform feature-engineering and produce a large number of potentially useful features in many machine learning problems. For example, in a typical equity forecasting model, n is around 3,000 (i.e., using 10 years of market data), whereas the number of potentially relevant features can be in the order of thousands <ref type="bibr">[ 33,</ref><ref type="bibr">22,</ref><ref type="bibr">25,</ref><ref type="bibr">13]</ref>. In predicting the popularity of a user in an online social network, n is in the order of hundreds (each day is an observation and a typical dataset contains less than three years of data) whereas the feature size can easily be more than 10k <ref type="bibr">[36,</ref><ref type="bibr">6,</ref><ref type="bibr">38]</ref>.</p><p>Existing low-rank regularization techniques (e.g., <ref type="bibr">[3,</ref><ref type="bibr">23,</ref><ref type="bibr">26,</ref><ref type="bibr">30,</ref><ref type="bibr">27]</ref> ) are not optimized for the large feature size setting. These results assume that either the features possess the so-called restricted isometry property <ref type="bibr">[10]</ref>, or their covariance matrix can be accurately estimated <ref type="bibr">[30]</ref>. Therefore, their sample complexity n depends on either d 1 or the smallest eigenvalue value &#955; min of x 's covariance matrix. For example, a mean-squared error (MSE) result that appeared in <ref type="bibr">[ 30]</ref> is of the form O r(d 1 +d 2 ) n&#955; 2 min . When n &#8804; d 1 /&#955; 2 min , this result becomes trivial because the forecast &#375; = 0 produces a comparable MSE. We design an efficient algorithm for the large feature size setting. Our algorithm is a simple two-stage algorithm. Let X &#8712; R n&#215;d 1 be a matrix that stacks together all features and Y &#8712; R n&#215;d 2 be the one that stacks the responses. In the first stage, we run a principal component analysis (PCA) on X to obtain a set of uncorrelated features &#7824; . In the second stage, we run another PCA to obtain a low rank approximation of &#7824; T Y and use it to construct an output.</p><p>While the algorithm is operationally simple, we show a powerful and generic result on using PCA to process features, a widely used practice for "dimensionality reduction" <ref type="bibr">[11,</ref><ref type="bibr">21,</ref><ref type="bibr">19]</ref>. PCA is known to be effective to orthogonalize features by keeping only the subspace explaining large variations. But its performance can only be analyzed under the so-called factor model <ref type="bibr">[40,</ref><ref type="bibr">39]</ref>. We show the efficacy of PCA without the factor model assumption. Instead, PCA should be interpreted as a robust estimator of x's covariance matrix. The empirical estimator C = 1 n XX T in the high-dimensional setting cannot be directly used because n d 1 &#215; d 2 , but it exhibits an interesting regularity: the leading eigenvectors of C are closer to ground truth than the remaining ones. In addition, the number of reliable eigenvectors grows as the sample size grows, so our PCA procedure projects the features along reliable eigenvectors and dynamically adjusts &#7824; 's rank to maximally utilize the raw features. Under mild conditions on the ground-truth covariance matrix C * of x, we show that it is always possible to decompose x into a set of near-independent features and a set of (discarded) features that have an inconsequential impact on a model's MSE.</p><p>When features x are transformed into uncorrelated ones z, our original problem becomes y = Nz+ , which can be reduced to a matrix denoising problem <ref type="bibr">[16]</ref> and be solved by the second stage. Our algorithm guarantees that we can recover all singular vectors of N whose associated singular values are larger than a certain threshold &#964; . The performance guarantee can be translated into MSE bounds parametrized by commonly used variables (though, these translations usually lead to looser bounds). For example, when N 's rank is r , our result reduces the MSE from O( r(d 1 +d 2 )</p><p>for a suitably small constant c . The improvement is most pronounced when n d 1 .</p><p>We also provide a new matching lower bound. Our lower bound asserts that no algorithm can recover a fraction of singular vectors of N whose associated singular values are smaller than &#961;&#964; , where &#961; is a "gap parameter". Our lower bound contribution is twofold. First, we introduce a notion of "local minimax", which enables us to define a lower bound parametrized by the singular values of N . This is a stronger lower bound than those delivered by the standard minimax framework, which are often parametrized by the rank r of N <ref type="bibr">[26]</ref>. Second, we develop a new probabilistic technique for establishing lower bounds under the new local minimax framework. Roughly speaking, our techniques assemble a large collection of matrices that share the same singular values of N but are far from each other, so no algorithm can successfully distinguish these matrices with identical spectra.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>Notation. Let X &#8712; R n&#215;d 1 and Y &#8712; R n&#215;d 2 be data matrices with their i -th rows representing the i -th observation. For matrix A , we denote its singular value decomposition as</p><p>T is the rank r approximation obtained by keeping the top r singular values and the corresponding singular vectors. When the context is clear, we drop the superscript A and use U, &#931;, and V ( U r , &#931; r , and V r ) instead. Both &#963; i (A) and &#963; A i are used to refer to i -th singular value of A . We use MATLAB notation when we refer to a specific row or column, e.g., V 1,: is the first row of V and V :,1 is the first column. kAk F , kAk 2 , and kAk * are Frobenius, spectral, and nuclear norms of A . In general, we use boldface upper case (e.g., X ) to denote data matrices and boldface lower case (e.g., x) to denote one sample. Regular fonts denote other matrices. Let C * = IE[xx T ] and C = 1 n X T X be the empirical estimate of C * . Let C * = V * &#923; * (V * ) T be the eigen-decomposition of the matrix C * , and &#955; * 1 &#8805; &#955; * 2 , . . . , &#8805; &#955; * d 1 &#8805; 0 be the diagonal entries of &#923; * . Let {u 1 , u 2 , . . . u `} be an arbitrary set of column vectors, andSpan({u 1 , u 2 , . . . , u `}) be the subspace spanned by it. An event happens with high probability means that it happens with probability &#8805; 1 -n -5 , where 5 is an arbitrarily chosen large constant and is not optimized. (1) is a tunable parameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Absolute value thresholding.</head><p>3 &#952; is a suitable constant; &#963; is std. of the noise.</p><p>5 return Pk 2 ( N+ ) for most results we develop. We assume a PAC learning framework, i.e., we observe a sequence {(x i , yi )} i&#8804;n of independent samples and our goal is to find an M that minimizes the test error IE x,y [k Mx -Mxk 2  2 ]. We are specifically interested in the setting in which</p><p>The key assumption we make to circumvent the d 1 &#8805; n issue is that the features are correlated. This assumption can be justified for the following reasons: (i) In practice, it is difficult, if not impossible, to construct completely uncorrelated features. (ii) When n d 1 , it is not even possible to test whether the features are uncorrelated <ref type="bibr">[5]</ref>. (iii) When we indeed know that the features are independent, there are significantly simpler methods to design models. For example, we can build multiple models such that each model regresses on an individual feature of x , and then use a boosting/bagging method <ref type="bibr">[19,</ref><ref type="bibr">37]</ref> to consolidate the predictions.</p><p>The correlatedness assumption implies that the eigenvalues of C * decays. The only (full rank) positive semidefinite matrices that have non-decaying (uniform) eigenvalues are the identity matrix (up to some scaling). In other words, when C * has uniform eigenvalues, x has to be uncorrelated.</p><p>We aim to design an algorithm that works even when the decay is slow, such as when &#955; i (C * ) has a heavy tail. Specifically, our algorithm assumes &#955; i 's are bounded by a heavy-tail power law series: Assumption 2.1. The &#955; i (C * ) series satisfies &#955; i (C * ) &#8804; c &#8226; i -&#969; for a constant c and &#969; &#8805; 2. We do not make functional form assumptions on &#955; i 's. This assumption also covers many benign cases, such as when C * has low rank or its eigenvalues decay exponentially. Many empirical studies report power law distributions of data covariance matrices <ref type="bibr">[2,</ref><ref type="bibr">31,</ref><ref type="bibr">44,</ref><ref type="bibr">14]</ref>. Next, we make standard normalization assumptions.</p><p>, and &#963; &#8805; 1 . Remark that we assume only the spectral norm of M is bounded, while its Frobenius norm can be unbounded. Also, we assume the noise &#963; &#8805; 1 is sufficiently large, which is more important in practice. The case when &#963; is small can be tackled in a similar fashion. Finally, our studies avoid examining excessively unrealistic cases, so we assume d 1 &#8804; d 3  2 . We examine the setting where existing algorithms fail to deliver non-trivial MSE, so we assume that n &#8804; rd 1 &#8804; d 4  2 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Upper bound</head><p>Our algorithm (see Fig. <ref type="figure">1</ref>) consists of two steps.</p><p>Step 1. Producing uncorrelated features. We run a PCA to obtain a total number of k 1 orthogonalized features. See STEP -1-PCA-X in Fig. <ref type="figure">1</ref>. Let the SVD of X be X = U&#931;(V ) T . Let k 1 be a suitable rank chosen by inspecting the gaps of X 's singular values (Line 5 in STEP -1-PCA-X ). &#7824; + = &#8730; nU k 1 is the set of transformed features output by this step. The subscript + in &#7824; + reflects that a dimension reduction happens so the number of columns in &#7824; + is smaller than that in X . Compared to standard PCA dimension reduction, there are two differences: (i) We use the left leading singular vectors of X (with a re-scaling factor &#8730; n ) as the output, whereas the PCA reduction outputs Pk 1 (X) . (ii) We design a specialized rule to choose k 1 whereas PCA usually uses a hard thresholding or other ad-hoc rules.</p><p>Step 2. Matrix denoising. We run a second PCA on the matrix ( N+ ) T , 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Intuition of the design</head><p>While the algorithm is operationally simple, its design is motivated by carefully unfolding the statistical structure of the problem. We shall realize that applying PCA on the features should not be viewed as removing noise from a factor model, or finding subspaces that maximize variations explained by the subspaces as suggested in the standard literature <ref type="bibr">[19,</ref><ref type="bibr">40,</ref><ref type="bibr">41]</ref>. Instead, it implicitly implements a robust estimator for x's precision matrix, and the design of the estimator needs to be coupled with our objective of forecasting y, thus resulting in a new way of choosing the rank.</p><p>Design motivation: warm up. We first examine a simplified problemy = Nz + , where variables in z are assumed to be uncorrelated. Assume d = d 1 = d 2 in this simplified setting. Observe that</p><p>where E is the noise term and E can be approximated by a matrix with independent zero-mean noises.</p><p>Solving the matrix denoising problem. Eq. 1 implies that when we compute Z T Y , the problem reduces to an extensively studied matrix denoising problem <ref type="bibr">[16,</ref><ref type="bibr">20]</ref>. We include the intuition for solving this problem for completeness. The signal N T is overlaid with a noise matrix E . E will elevate all the singular values of N T by an order of &#963; p d/n . We run a PCA to extract reliable signals: when the singular value of a subspace is &#963; p d/n , the subspace contains significantly more signal than noise and thus we keep the subspace. Similarly, a subspace associated a singular value . &#963; p d/n mostly contains noise. This leads to a hard thresholding algorithm that sets N T = P r (N T + E) , where r is the maximum index such that &#963; r (N T + E) &#8805; c p d/n for some constant c . In the general setting y = Mx + , x may not be uncorrelated. But when we set z = (&#923; * ) -1 2 (V * ) T x , we see that IE[zz T ] = I . This means knowing C * suffices to reduce the original problem to a simplified one. Therefore, our algorithm uses Step 1 to estimate C * and Z , and uses Step 2 to reduce the problem to a matrix denoising one and solve it by standard thresholding techniques.</p><p>Relationship between PCA and precision matrix estimation. In step 1, while we plan to estimate C * , our algorithm runs a PCA on X . We observe that empirical covariance matrix C = 1 n X T X =  When k 1 &lt; d 1 , PCA uses a low rank approximation of C as an estimator for C * . We now explain why this is effective. First, note that C is very far from C * when n d 1 , therefore it is dangerous to directly plug in C to find &#7825;. Second, an interesting regularity of C exists and can be best explained by a picture. In Fig. <ref type="figure">2</ref>, we plot the pairwise angles between eigenvectors of C and those of C * from a synthetic dataset. Columns are sorted by the C * 's eigenvalues in decreasing order. When C * and C coincide, this plot would look like an identity matrix. When C and C * are unrelated, then the plot behaves like a block of white Gaussian noise. We observe a pronounced pattern: the angle matrix can be roughly divided into two sub-blocks (see the red lines in Fig. <ref type="figure">2</ref>). The upper left sub-block behaves like an identity matrix, suggesting that the leading eigenvectors of C are close to those of C * . The lower right block behaves like a white noise matrix, suggesting that the "small" eigenvectors of C are far from those of C * . When n grows, one can observe the upper left block becomes larger and this the eigenvectors of C will sequentially get stabilized. Leading eigenvectors are first stabilized, followed by smaller ones. Our algorithm leverages this regularity by keeping only a suitable number of reliable eigenvectors from C while ensuring not much information is lost when we throw away those "small" eigenvectors.</p><p>Implementing the rank selection. We rely on three interacting building blocks:</p><p>1. Dimension-free matrix concentration. First, we need to find a concentration behavior of C for n &#8804; d 1 to decouple d 1 from the MSE bound. We utilize a dimension-free matrix concentration inequality <ref type="bibr">[32]</ref>. Roughly speaking, the concentration behaves as kC -C * k 2 &#8776; n -1 2 . This guarantees that |&#955; i (C) -&#955; i (C * )| &#8804; n -1 2 by standard matrix perturbation results <ref type="bibr">[24]</ref>. 2. Davis-Kahan perturbation result. However, the pairwise closeness of the &#955; i 's does not imply the eigenvectors are also close. When &#955; i (C * ) and &#955; i+1 (C * ) are close, the corresponding eigenvectors in C can be "jammed" together. Thus, we need to identify an index i , at which &#955; i (C * ) -&#955; i+1 (C * ) exhibits significant gap, and use a Davis-Kahan result to show thatPi (C) is close to Pi (C * ). On the other hand, the map &#928; * (, (&#923; * ) -1 2 (V * ) T ) we aim to find depends on the square root of inverse (&#923; * ) -1 2 , so we need additional manipulation to argue our estimate is close to (&#923; * ) -1 2 (V * ) T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The connection between gap and tail.</head><p>Finally, the performance of our procedure is also characterized by the total volume of signals that are discarded, i.e., P</p><p>where k 1 is the location that exhibits the gap. The question becomes whether it is possible to identify a k 1 that simultaneously exhibits a large gap and ensures the tail after it is well-controlled, e.g., the sum of the tail is O(n -c ) for a constant c . We develop a combinatorial analysis to show that it is always possible to find such a gap under the assumption that &#955; i (C * ) is bounded by a power law distribution with exponent &#969; &#8805; 2. Combining all these three building blocks, we have: Proposition 1. Let &#958; and &#948; be two tunable parameters such that&#958; = &#969;(log 3 n/ &#8730; n) and &#948; 3 = &#969;(&#958;) .</p><p>Assume that &#955; * i &#8804; c&#8226;i -&#969; . Consider running STEP -1-PCA-X in Fig. <ref type="figure">1</ref>, with high probability, we have (i) Leading eigenvectors/values are close: there exists a unitary matrix W and a constant c</p><p>(ii) Small tail:</p><p>for a constant c 2 . Prop. 1 implies that our estimate &#7825;+ = &#928;(x) is sufficiently close to z = &#928; * (x) , up to a unitary transform. We then execute STEP -2-PCA-D ENOISE to reduce the problem to a matrix denoising one and solve it by hard-thresholding. Let us refer to y = Nz + , where z is a standard multivariate Gaussian and N = MV * (&#923; * ) 1 2 as the orthogonalized form of the problem. While we do not directly observe z, our performance is characterized by spectra structure of N . </p><p>The expectation is over the randomness of the test data.</p><p>Theorem 1 also implies that there exists a way to parametrize &#958; and &#948; such that</p><p>+ O(n -c 0 ) for some constant c 0 . We next interpret each term in <ref type="bibr">(2)</ref>.</p><p>are typical for solving a matrix denoising problem N T + + E(&#8776; N T + E) : we can extract signals associated with ` * leading singular vectors of N , so P i&gt;` * (&#963; N i ) 2 starts at i &gt; ` * . For each direction we extract, we need to pay a noise term of order &#952; 2 &#963; 2 d 2 n , leading to the term</p><p>come from the estimations error of &#7825;+ produced from Prop. 1, consisting of both estimation errors of C * 's leading eigenvectors and the error of cutting out a tail. We pay an exponent of 1  4 on both terms (e.g., &#948; &#969;-1 &#969;+1</p><p>in Prop. 1 becomes &#948; &#969;-1 4(&#969;+1)</p><p>) because we used Cauchy-Schwarz (CS) twice. One is used in running matrix denoising algorithm with inaccurate z + ; the other one is used to bound the impact of cutting a tail. It remains open whether two CS is can be circumvented.</p><p>Sec. 4 explains how Thm 1 and the lower bound imply the algorithm is near-optimal. Sec. 5 compares our result with existing ones under other parametrizations, e.g. rank(M ) . n . However, it may well happen that most of the spectral 'mass' of N lies only slightly below this threshold &#964; . In this section, we establish that no algorithm can do better than us, in a bi-criteria sense, i.e. we show that any algorithm that has a slightly smaller sample than ours can only minimally outperform ours in terms of MSE.</p><p>We establish 'instance dependent' lower bounds: When there is more 'spectral mass' below the threshold, the performance of our algorithm will be worse, and we will need to establish that no algorithm can do much better. This departs from the standard minimax framework, in which one examines the entire parameter space of N , e.g. all rank r matrices, and produces a large set of statistically indistinguishable 'bad' instances <ref type="bibr">[43]</ref>. These lower bounds are not sensitive to instancespecific quantities such as the spectrum of N , and in particular, if prior knowledge suggests that the unknown parameter N is far from these bad instances, the minimax lower bound cannot be applied.</p><p>We introduce the notion of local minimax. We partition the space into parts so that similar matrices are together. Similar matrices are those N that have the same singular values and right singular vectors; we establish strong lower bounds even against algorithms that know the singular values and right singular vectors of N . An equivalent view is to assume that the algorithm has oracle access to C * , M 's singular values, and M 's right singular vectors. This algorithm can solve the orthogonalized form as N 's singular values and right singular vectors can easily be deduced. Thus, the only reason why the algorithm needs data is to learn the left singular vectors of N . The lower bound we establish is the minimax bound for this 'unfair' comparison, where the competing algorithm is given more information. In fact, this can be reduced further, i.e., even if the algorithm 'knows' that the left singular vectors of N are sparse, identifying the locations of the non-zero entries is the key difficulty that leads to the lower bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1 (Local minimax bound). Consider a model y = Mx +</head><p>, where x is a random vector, so C * (x) = IE[xx T ] represents the co-variance matrix of the data distribution, and</p><p>is an equivalence relation and let the equivalence class of(M, x) be R(M, x) = {(M 0 , x 0 ) : &#931; M 0 = &#931; M , V M 0 = V M , and C * (x 0 ) = C * (x)}. The local minimax bound for y = Mx + with n independent samples and N &#8764; (0,</p><p>It is worth interpreting (3) in some detail. For any two (M, x) , (M 0 , x 0 ) in R(M, x) , the algorithm has the same 'prior knowledge', so it can only distinguish between the two instances by using the observed data, in particular M is a function only of X and Y , and we denote it as M(X, Y) to emphasize this. Thus, we can evaluate the performance of M by looking at the worst possible (M 0 , x 0 ) and considering the MSE IEk M(X, Y)x 0 -M 0 x 0 k 2 .</p><p>Proposition 2. Consider the problem y = Mx + with normalized form y = Nz + . Let &#958; be a sufficient small constant. There exists a sufficiently small constant &#961; 0 (that depends on &#958; ) and a constant c such that for any &#961; &#8804; &#961; 0 , r(x, M, n, &#963; ) &#8805; (1 -c&#961;</p><p>, where</p><p>n . Proposition 2 gives the lower bound on the MSE in expectation; it can be turned into a high probability result with suitable modifications. The proof of the lower bound uses a similar 'trick' to the one used in the analysis of the upper bound analysis to cut the tail. This results in an additional term O &#961; 1 2 -&#958; d &#969;-1 2 which is generally smaller than the n -c 0 tail term in Theorem 1 and does not dominate the gap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gap requirement and bi-criteria approximation algorithms</head><p>n . Theorem 1 asserts that any signal above the threshold &#952;&#964; can be detected, i.e., the MSE is at most P &#963; N i &gt;&#952;&#964; &#963; 2 i (N ) (plus inevitable noise), whereas Proposition 2 asserts that any signal below the threshold &#961;&#964; cannot be detected, i.e., the MSE is approximately at least P</p><p>There is a 'gap' between &#952;&#964; and &#961;&#964; , as &#952; &gt; 1 and &#961; &lt; 1. See Fig. <ref type="figure">3(a)</ref>. This kind of gap is inevitable because both bounds are 'high probability' statements. This gap phenomenon appears naturally when the sample size is small as can be illustrated by this simple example. Consider the problem of estimating &#181; when we see one sample from N (&#181;, &#963; 2 ). Roughly speaking, when &#181; &#963; , the estimation is feasible, and whereas &#181; &#963; , the estimation is impossible. For the region &#181; &#8776; &#963; , algorithms fail with constant probability and we cannot prove a high probability lower bound either.</p><p>While many of the signals can 'hide' in the gap, the inability to detect signals in the gap is a transient phenomenon. When the number of samples n is modestly increased, our detection threshold</p><p>n shrinks, and this hidden signal can be fully recovered. This observation naturally leads to a notion of bi-criteria optimization that frequently arises in approximation algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2. An algorithm for solving the y = Mx +</head><p>problem is (&#945;, &#946;) -optimal if, when given an i.i.d. sample of size &#945;n as input, it outputs an estimator whose MSE is at most &#946; worse than the local minimax bound, i.e., IE[k&#375;yk 2  2 ] &#8804; r(x, M, n, &#963; ) + &#946;. Corollary 1. Let &#958; and c 0 be small constants and &#961; be a tunable parameter. Our algorithm is (&#945;, &#946;) -optimal for &#945; = 2  2 that is directly characterized by the signal strength and an additive term O(n -c 0 ) = o <ref type="bibr">(1)</ref> . Assuming that kMxk = &#8486;(1) , i.e., the signal is not too weak, the term &#946; becomes a single multiplicative bound O(&#961; 2  2 . This gives an easily interpretable result. For example, when our data size is n log n, the performance gap between our algorithm and any algorithm that uses n samples is at most o(kMxk 2  2 ). The improvement is significant when other baselines deliver MSE in the additive form that could be larger than kMxk 2   2   in the regime n &#8804; d 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preview of techniques.</head><p>Let N = U N &#931; N (V N ) T be the instance (in orthogonalized form). Our goal is to construct a collection N = {N 1 , . . . , N K } of K matrices so that (i) For any N i &#8712; N , &#931; N i = &#931; N and V N i = V N . (ii) For any two N i , N j &#8712; N , kN -N 0 k F is large, and (iii) K = exp(&#8486;(poly(&#961;)d 2 )) (cf. <ref type="bibr">[43,</ref><ref type="bibr">Chap. 2]</ref>) Condition (i) ensures that it suffices to construct unitary matrices U N i 's for N , and that the resulting instances will be in the same equivalence class. Conditions (ii) and (iii) resemble standard construction of codes in information theory: we need a large 'code rate', corresponding to requiring a large K as well as large distances between codewords, corresponding to requiring that kU i -U j k F be large. Standard approaches for constructing such collections run into difficulties. Getting a sufficiently tight concentration bound on the distance between two random unitary matrices is difficult as the matrix entries, by necessity, are correlated. On the other hand, starting with a large collection of random unit and using its Cartesian product to build matrices does not necessarily yield unitary matrices.</p><p>We design a two-stage approach to decouple condition (iii) from (i) and (ii) by only generating sparse matrices U N i . See Fig. <ref type="figure">3</ref> The existence of such objects can easily be proved using the probabilistic method. Thus, in the first stage, we can build up a large number of sparsity patterns. In the second stage (Step 3 in Fig. <ref type="figure">3(d</ref>)), we carefully fill in values in the non-zero positions for each U N i . When the number of non-zero entries is not too small, satisfying the unitary constraint is feasible. As the overlap of sparsity patterns of any two matrices is small, we can argue the distance between them is large. By carefully trading off the number of non-zero positions and the portion of overlap, we can simultaneously satisfy all three conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related work and comparison</head><p>In this section, we compare our results to other regression algorithms that make low rank constraints on M . Most existing MSE results are parametrized by the rank or spectral properties of M , e.g.</p><p>[30] defined a generalized notion of rank Bq(R A q ) &#8712; A &#8712; R d 2 &#215;d 1 : P d 2 i=1 |&#963; A i | q &#8804; R q , where q &#8712; [0, 1], A {N, M } &#8712; , i.e. R N q characterizes the generalized rank of N whereas R M q characterizes that of M . When q = 0, R N q = R M q is the rank of the N because rank(N ) = rank(M ) in our setting.</p><p>In their setting, the MSE is parametrized by R M and is shown to be</p><p>In the special case when q = 0 , this reduces to</p><p>. On the other hand, the MSE in our case is bounded by (cf. Thm. 1). We have</p><p>The improvement here is twofold. First, our bound is directly characterized by N in orthogonalized form, whereas result of <ref type="bibr">[30]</ref> needs to examine the interaction between M and C * , so their MSE depends on both R M q and &#955; * min . Second, our bound no longer depends on d 1 and pays only an additive factor n -c 0 , thus, when n &lt; d 1 , our result is significantly better. Other works have different parameters in the upper bounds, but all of these existing results require that n &gt; d 1 to obtain nontrivial upper bounds <ref type="bibr">[26,</ref><ref type="bibr">9,</ref><ref type="bibr">12,</ref><ref type="bibr">26]</ref>. Unlike these prior work, we require a stochastic assumption on X (the rows are i.i.d.) to ensure that the model is identifiable when n &lt; d 1 , e.g. there could be two sets of disjoint features that fit the training data equally well. Our algorithm produces an adaptive model whose complexity is controlled by k 1 and k 2 , which are adjusted dynamically depending on the sample size and noise level. <ref type="bibr">[9]</ref> and <ref type="bibr">[12]</ref> also point out the need for adaptivity; however they still require n &gt; d 1 and make some strong assumptions. For instance, <ref type="bibr">[9]</ref> assumes that there is a gap between &#963; i (XM T ) and &#963; i+1 (XM T ) for some i . In comparison, our sufficient condition, the decay of &#955; * i , is more natural. Our work is not directly comparable to standard variable selection techniques such as LASSO <ref type="bibr">[42]</ref> because they handle univariatey. Column selection algorithms <ref type="bibr">[15]</ref> generalize variable selection methods for vector responses, but they cannot address the identifiability concern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experiments</head><p>We apply our algorithm on an equity market and a social network dataset to predict equity returns and user popularity respectively. Our baselines include ridge regression ("Ridge"), reduced rank ridge regression <ref type="bibr">[29]</ref> ("Reduced ridge"), LASSO ("Lasso"), nuclear norm regularized regression ("Nuclear norm"), reduced rank regression <ref type="bibr">[45]</ref> ("RRR"), and principal component regression <ref type="bibr">[1]</ref> ("PCR").</p><p>Predicting equity returns. We use a stock market dataset from an emerging market that consists of approximately 3600 stocks between 2011 and 2018. We focus on predicting the next 5-day returns. For each asset in the universe, we compute its past 1-day, past 5-day, and past 10-day returns as features. We use a standard approach to translate forecasts into positions <ref type="bibr">[ 4,</ref><ref type="bibr">47]</ref>. We examine two universes in this market: (i) Universe 1 is equivalent to S&amp; P 500 and consists of 983 stocks, and (ii) Full universe consists of all stocks except for illiquid ones.</p><p>Results. Table <ref type="table">1</ref> (left) reports the forecasting power and portfolio return for out-of-sample periods in Full universe (see our full version for Universe 1). We observe that (i) The data has a low signal-tonoise ratio. The out-of-sample R 2 values of all the methods are close to 0. (ii) ADAPTIVE -RRR has the highest forecasting power. (iii) ADAPTIVE -RRR has the smallest in-sample and out-of-sample gap (see column out -in ), suggesting that our model is better at avoiding spurious signals.</p><p>Predicting user popularity in social networks. We collected tweet data on political topics from Oct. 2016 to Dec. 2017. Our goal is to predict a user's next 1-day popularity, which is defined as the sum of retweets, quotes, and replies received by user. There are a total of 19 million distinct users, and due to the huge size, we extract the subset of 2000 users with the most interactions for evaluation. For each user in the 2000-user set, we use its past 5 days' popularity as features. We further randomly sample 200 users and make predictions for them, i.e., setting d 2 = 200 to make d 2 of the same magnitude as n .</p><p>Results. We randomly sample users for 10 times and report the average MSE and correlation (with standard deviations) for both in-sample and out-of-sample data (see full version for more results). In Table <ref type="table">1</ref> (right) we can see results consistent with the equity returns experiment: (i) ADAPTIVE -RRR yields the best performance in out-of-sample MSE and correlation. (ii)ADAPTIVE -RRR achieves the best generalization error by having a much smaller gap between training and test metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>This paper examines the low-rank regression problem under the high-dimensional setting. We design the first learning algorithm with provable statistical guarantees under a mild condition on the features' covariance matrix. Our algorithm is simple and computationally more efficient than low rank methods based on optimizing nuclear norms. Our theoretical analysis of the upper bound and lower bound can be of independent interest. Our preliminary experimental results demonstrate the efficacy of our algorithm. The full version explains why our (algorithm) result is unlikely to be known or trivial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Broader Impact</head><p>The main contribution of this work is theoretical. Productionizing downstream applications stated in the paper may need to take six months or more so there is no immediate societal impact from this project.</p></div></body>
		</text>
</TEI>
