<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Nonparametric Testing under Randomized Sketching</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/28/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10282595</idno>
					<idno type="doi"></idno>
					<title level='j'>IEEE transactions on pattern analysis and machine intelligence</title>
<idno>2160-9292</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Meimei Liu</author><author>Zuofeng Shang</author><author>Yun Yang</author><author>Guang Cheng</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. An adaptive testing procedure is further established without prior knowledge of regularity. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues. Simulations and real data analysis are conducted to support our theory.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>A Number of computationally efficient statistical meth- ods have been proposed for analyzing massive data sets. Examples include divide-and-conquer approaches <ref type="bibr">[1]</ref>- <ref type="bibr">[4]</ref>; low-rank approximations: random projection methods <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>, subsampling methods <ref type="bibr">[9]</ref>- <ref type="bibr">[11]</ref>, Nystr&#246;m approximations <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>; and online learning methods <ref type="bibr">[14]</ref>- <ref type="bibr">[16]</ref>.</p><p>An interesting question arising from these new methods is the minimum computational cost required for obtaining statistically satisfactory solutions. This might be viewed as a type of "computational limit" from a statistical perspective. Such an issue has been addressed in certain situations. For divide-and-conquer approaches, <ref type="bibr">[4]</ref> derived a sharp upper bound for the number of distributed computing units in the smoothing spline setup, while <ref type="bibr">[17]</ref> estimated the quantile regression process under an additional sharp lower bound on the number of quantile levels. For random projection methods, the literature nonetheless only focused on parametric cases such as compressed sensing. For example, <ref type="bibr">[18]</ref> showed that the minimum number of random projections is s log n for signal recovery, where n is the number of measurements and s is the number of nonzero components in the true signal. Recently, <ref type="bibr">[7]</ref> proposed the randomly sketched kernel ridge regression (KRR) estimator and studied the minimax optimal nonparametric estimation under random projection. To our knowledge, the computational limit for random projection methods remains unknown in nonparametric models.</p><p>There are two purposes in this paper: (i) develop an optimal nonparametric testing procedure based on random projection; (ii) explore its computational limit in the kernel ridge regression setup. We remark that classical nonparametric testing methods, e.g., the locally most powerful test, the generalized/penalized likelihood ratio test and the distance-based test <ref type="bibr">[19]</ref>- <ref type="bibr">[23]</ref>, may not be directly applied to big data due to their high computational costs.</p><p>Specifically, we consider the following nonparametric model</p><p>where x i &#8712; X &#8838; R a for a fixed a &#8805; 1 are i.i.d. random design points, and i are random noise following Normal distribution with mean zero, variance &#963;<ref type="foot">foot_0</ref> . The regression function f belongs to a reproducing kernel Hilbert space (RKHS) H. The hypothesis of interest is</p><p>where f 0 is a hypothesized function. Testing optimality has been well studied in literature. <ref type="bibr">[24]</ref> [25] established the minimax testing rate for Gaussian sequence model on the Sobolev class or the Besov class. Recently, <ref type="bibr">[26]</ref> established the minimax nonparametric testing rate under a general eigen-decaying framework including the polynomial decay and exponential decay kernels. In practice, testing in (2) has wide applications. One motivating example is in the signal detection in cognitive radio and other wireless applications, it is assumed under the null hypothesis that the signal is completely specified, e.g., that no signal is present. Another example is testing the adequacy of a parametric linear model in nonparametric regression, where the null hypothesis assumes f 0 has a linear structure; see <ref type="bibr">[27]</ref>, <ref type="bibr">[28]</ref>.</p><p>Focusing on the testing in <ref type="bibr">(2)</ref>, we construct a distancebased test statistic T n,&#955; = f R -f 0 compared with the classic KRR estimator f n ( <ref type="bibr">[29]</ref>) defined as</p><p>where f 2 H = f, f H with &#8226;, &#8226; H the inner product of H, &#955; &gt; 0 is a smoothing parameter. Solving ( <ref type="formula">3</ref>) is an ndimensional quadratic program, which involves the computational cost and storage occupation of orders O(n 3 ) and O(n 2 ), respectively. Instead, f R is achieved by randomly projecting the row and column subspaces of the ndimensional kernel matrix to an s-dimensional subspace, reducing (3) to an s-dimensional quadratic program, with time and storage costs as O(s 3 ) and O(s 2 ) under s( n) random projections; see Section 2 for detailed algorithm. The pre-processing step in computing the kernel approximation normally takes O(sn 2 ), and can be easily reduced to O(n 2 (log s)) for suitably chosen random matrices (see <ref type="bibr">[30]</ref>), which can be further reduced to O(n 2 (log s)/t) by using t clusters in a parallel fashion. After f R is obtained, T n,&#955; can be computed in a parallel fashion. Hence, s can be viewed as a simple proxy for computing and storage costs.</p><p>In this paper, we reveal a phase transition phenomenon in terms of s to guarantee the testing optimality. Specifically, a sharp lower bound for s is established: when s is above this bound, T n,&#955; is minimax optimal; otherwise, minimax optimality becomes impossible even when the best possible &#955; is chosen. We next illustrate more subtle details using the following Figure <ref type="figure">1</ref>, where the strength of the weakest detectable signals (SWDS) is characterized given any s and &#955;. In general, we require s &gt; s &#955; for any &#955;, where s &#955; is determined by kernel eigenvalues and &#955;. An important observation is that the smallest SWDS can be achieved at &#955; = &#955; * and s &gt; s &#955; * := s * (note that when s s * , our testing procedure under a proper &#955; is still powerful as long as SWDS becomes sufficiently large). Both &#955; * and s * have precise orders in specific situations. For example, in an morder polynomial decay kernel, the smallest SWDS achieves the minimax optimal rate n -2m 4m+1 established in <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref> when &#955; * = n -4m 4m+1 and s * = n 2 4m+1 . As a by-product, we also derive a sharp lower bound for s for obtaining the minimax optimal estimation. Our results hold for a general class of random projection matrix, such as the sub-Gaussian matrix or certain data-dependent matrix.</p><p>It is worth mentioning that the construction of T n,&#955; crucially relies on the regularity of H, which is often unavailable in practice. Hence, we propose an adaptive test statistic based on the maximum of a sequence of (standardized) nonadaptive test statistics corresponding to various regularities. Based on a recent Gaussian approximation result in <ref type="bibr">[31]</ref>, we prove that the null limit distribution is an extreme value distribution.</p><p>The proofs of main results rely on the behavior of the tail sum of empirical kernel eigenvalues. One technical contribution of this work is to derive upper bounds for a range of tail sums such that nonparametric estimation and testing can now be analyzed in a unified framework. This is obtained by flexibly adjusting the size of the function class associated with the Rademacher average in the local Rademacher complexity theory ( <ref type="bibr">[32]</ref>). In simulation studies, we find that the size and power of the proposed non-adaptive and adaptive test statistics are both satisfactory. In particular, the power cannot be further improved as the number of random projections grows beyond some threshold, as predicted by our theory. For an illustration purpose, we also demonstrate that when n = 2 12 , conducting testing based on f R only takes 3.2 seconds in comparison with 42 seconds based on f n . In practice, the smoothing parameter &#955; can be directly selected via generalized cross validation. We would like to point out that this is an advantage of the random projection method over the divide-and-conquer method <ref type="bibr">[1]</ref> in estimation, where the selection of the smoothing parameter is nontrivial; see <ref type="bibr">[33]</ref>.</p><p>The rest of this paper is organized as follows. Section 2 introduces kernel ridge regression together with its approximation based on random projection. Our main results are presented in Section 3: Section 3.1 introduces one primary assumption on random projection; Sections 3.2 and 3.3 study testing consistency and power behaviors in terms of the projection dimension s and the smoothing parameter &#955;, with specific situations considered in Section 3.5; Section 3.6 proves the lower bound on s given in Section 3.5 to be sharp. An adaptive testing procedure is developed in Section 4.  2 . For a matrix A &#8712; R m&#215;n , its operator norm is defined as</p><p>A random variable X is said to be sub-Gaussian if there exists a constant &#963; 2 &gt; 0 such that for any t &#8805; 0, P[|X| &#8805; t] &#8804; 2 exp(-t 2 /(2&#963; 2 )). The sub-Gaussian norm of X is defined as X &#968;2 = inf{t &gt; 0 : E exp(X 2 /t 2 ) &#8804; 2}. We will use c, c 1 , c 2 , C to denote generic absolute constants, whose values may vary from line to line.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">KERNEL RIDGE REGRESSION VIA RANDOM PROJECTION</head><p>In this section, we review kernel ridge regression and its variant based on random projection. Suppose that we have n i.i.d. observations {(x i , y i )} n i=1 from <ref type="bibr">(1)</ref>, where the covariances x i are samples of X from a distribution P X with domain X . Let X be a closed subset of R a , a is fixed,P X a strictly positive Borel measure on X . We recall that a Borel measure P X on X is said to be strictly positive if the measure of every nonempty open subset in X is positive, an example being the Lebesgue measure in R a .</p><p>Throughout assume that f &#8712; H, where H &#8834; L 2 (P X ) is a reproducing kernel Hilbert space (RKHS) associated with an inner product &#8226;, &#8226; H and a reproducing kernel function K(&#8226;, &#8226;) : X &#215; X &#8594; R. K is a symmetric positive definite kernel on X satisfying: for any finite set of points {x i } n i=1 in X and real numbers {a i } n i=1 that n i,j=1 a i a j K(x i , x j ) &#8805; 0. Assume further that K is a continuous function on X &#215; X and X X K(x, t)dP X (x)dP X (t) &lt; &#8734;. Then by Mercer's theorem, K has the following spectral expansion:</p><p>where &#181; 1 &#8805; &#181; 2 &#8805; &#8226; &#8226; &#8226; &#8805; 0 is a sequence of ordered eigenvalues and the eigenfunctions {&#966; i } &#8734; i=1 form a basis in L 2 (P X ). We refer the reader to the standard sources <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref> for more details on RKHSs and their properties.</p><p>Moreover, for any i, j &#8712; N,</p><p>Throughout this paper, assume that &#966; i 's are uniformly bounded, a common condition in literature, e.g., <ref type="bibr">[36]</ref>, and &#181; i 's satisfy certain tail sum property.</p><p>Assumption A1 is satisfied in two types of commonly used kernels, categorized by the eigenvalue decay rates. The first is &#181; i i -2m for a constant m &gt; 0, called as polynomial decay kernel (PDK) of order m. Examples of kernels in this class include the m th order Sobolev spaces for some fixed integer m &#8805; 1 with Lebesgue measure on a bounded domain; see <ref type="bibr">[35]</ref>.</p><p>The second is &#181; i exp(-&#947;i p ) for constants &#947;, p &gt; 0, called as exponential decay kernel (EDK) of order p. Examples of EDK include the Gaussian kernel, which for the Lebesgue measure satisfies such a bound with p = 1 (compact domain) or p = 2 (real line); see <ref type="bibr">[37]</ref>. Verification of Assumption A1 with concrete examples is deferred to Section S.5.3 in Supplementary.</p><p>Recall the KRR estimator f n from (3). By representer theorem, it has an expression</p><p>where &#969; = ( &#969; 1 , . . . , &#969; n ) is a real vector determined by &#969; = argmin</p><p>,j&#8804;n , and I &#8712; R n&#215;n is identity. This standard procedure requires storing (K 2 , K, Ky) and inverting K + &#955;I, which requires O(n 2 ) memory usage and O(n 3 ) floating operations.</p><p>The above computational and storage constraints become severe for a large sample size, and thus motivate the random projection approach proposed by <ref type="bibr">[7]</ref>. Specifically, &#969; in ( <ref type="formula">5</ref>) is substituted with S &#946;, where &#946; &#8712; R s and S is an s &#215; n real-valued random matrix; see Section 3.1. Then, &#946; is solved as:</p><p>Hence, the resulting estimator of f becomes</p><p>which requires computing and storing (SK 2 S , SKS , SKy), along with inverting an s &#215; s matrix. The cost in the pre-processing step to compute the kernel approximation normally takes O(sn 2 ), and can be easily reduced to O(n 2 (log s)) for suitably chosen random matrices (see <ref type="bibr">[30]</ref>), which can be further reduced to O(n 2 (log s)/t) by using t clusters in a parallel fashion. Furthermore, the memory usage and floating operations are reduced to O(s 2 ) and O(s 3 ), respectively, when s = o(n). On the other hand, s cannot be too small in order to maintain sufficient data information for achieving statistical optimality. Critical lower bounds for s will be derived in Section 3.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">MAIN RESULTS</head><p>Consider the nonparametric testing problem <ref type="bibr">(2)</ref>. For convenience, assume f 0 = 0, i.e., we will test</p><p>In general, testing f = f 0 (for an arbitrary known <ref type="bibr">(8)</ref> has no loss of generality. Based on f R , we propose the following distancebased test statistic:</p><p>In the subsequent sections, we will derive the null limit distribution of T n,&#955; (Theorems 3.3), and further provide a sufficient and necessary condition in terms of s such that T n,&#955; is minimax optimal (Section 3.6). As a byproduct, we derive a critical bound in terms of s such that f R is minimax optimal. Proof of such results rely on an exact analysis on the kernel and projection matrices which requires an accurate estimate of the tail sum of the empirical eigenvalues by Lemma 3.1. Our results hold for a general choice of projection matrix which will be discussed in Section 3. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Choice of Projection Matrix</head><p>Consider the singular value decomposition K = U DU , where U = (U 1 , U 2 ) with U 1 consisting of the first s &#955; columns of U and U 2 consisting of the rest n -s &#955; columns,</p><p>Here { &#181; i } n i=1 are empirical eigenvalues in decreasing order. For any &#955; &gt; 0, s &#955; (or s &#955; ) is defined to be the number of &#181; i 's (or &#181; i 's) greater than &#955;, i.e.,</p><p>We have the following assumption on the population eigenvalues through s &#955; . Assumption A2. s &#955; diverges as &#955; &#8594; 0.</p><p>Assumption A2 is satisfied in various classes of kernels, including PDK and EDK introduced in Section 3.5. In the Supplementary S.5.3, we verify Assumption A2 with two concrete examples:</p><p>An accurate upper bound for the tail sum of empirical eigenvalues n i= s &#955;+1 &#181; i is needed for studying nonparametric testing and estimation. However, this bound was often assumed to hold in the kernel learning literature, e.g., <ref type="bibr">[38]</ref>, <ref type="bibr">[39]</ref>. The application of concentration inequalities of individual eigenvalues ( <ref type="bibr">[40]</ref>, <ref type="bibr">[41]</ref>) only provides a very loose bound due to accumulative errors. Recently, the local Rademacher complexity theory ( <ref type="bibr">[32]</ref>) was employed by <ref type="bibr">[7]</ref> to derive a more accurate upper bound that is useful in studying nonparametric estimation. However, this upper bound no longer works for testing problems, due to the improper size of the function class defining Rademacher average. We establish the upper bounds, i.e., Lemma 3.1, for a range of tail sums of empirical eigenvalues in terms of population quantities s &#955; and &#181; s &#955; , with known orders. This result can be applied to both nonparametric estimation and testing, and may be of independent interest. Lemma 3.1. If 1/n &lt; &#955; &#8594; 0, then with probability at least 1 -4e -s &#955; , n i= s &#955;+1 &#181; i s &#955; &#181; s &#955; . Clearly, Lemma 3.1 is a sample analog to the tail sum assumption for &#181; i in Assumption A1. The proof of Lemma 3.1 is based on the development of a generalized function class and its associated local Rademacher complexity theory as explained in Appendix 7.1.</p><p>The following definition of "K-satisfiability" describes a class of matrices that preserve the principal components of the kernel matrix. Definition 1. (K-satisfiability) A matrix S &#8712; R s&#215;n is said to be K-satisfiable if there exists a constant c &gt; 0 such that</p><p>By Definition 1, a K-satisfiable S will make (SU 1 ) SU 1 "nearly" identity as well as down-weight the tail eigenvalues. Such a matrix will be able to extract the principle information from the kernel matrix. A special case of the above "K-satisfiability" condition was studied in <ref type="bibr">[7]</ref> by fixing &#955; as the optimal estimation rate. However, by choosing a range of &#955; as threshold to select the leading eigenvalues, our general form of "K-satisfiability" condition allows us to study estimation and testing in a unified framework.</p><p>Besides, we need the following definition which will make the statement of our assumptions more precise and concise. Definition 2. An event E is said to be of (a, b)-type for a, b &#8712; (0, &#8734;],</p><p>Definition 2 describes events whose probabilities have exponential type lower bounds. It is easy to see that, if E is of (a, b)-type, then P(E)</p><p>In particular, E is of (&#8734;, &#8734;)-type if and only if E occurs almost surely.</p><p>Throughout the rest of this paper, assume the following condition on S. Assumption A3. (a) s &#8805; qs &#955; for a sufficiently large constant q &gt; 0. (b) There exist c 1 , c 2 &#8712; (0, &#8734;] such that the event "S is Ksatisfiable" is of (c 1 s, c 2 s &#955; )-type.</p><p>Assumption A3 (a) requires a sufficient amount of random projections to preserve data information. Assumption A3 (b) requires S to be K-satisfiable with high probability which holds in a broad range of situations such as matrix of sub-Gaussian entries (Example 3) and certain data dependent matrix (Example 4). Example 3. Let S be an s &#215; n random matrix of entries S ij / &#8730; s, i = 1, . . . , s, j = 1, . . . , n, where S ij are independent (not necessarily identically distributed) sub-Gaussian variables. Examples of such sub-Gaussian variables include Gaussian variables, bounded variables such as Bernoulli, multinomial, uniform, variables with strongly log-concave density (see <ref type="bibr">[42]</ref>), or mixtures of sub-Gaussian variables. The following lemma shows that Assumption A3 (b) holds in all these situations. Lemma 3.2. Let S ij : 1 &#8804; i &#8804; s, 1 &#8804; j &#8804; n be independent sub-Gaussian of mean zero and variance one, and &#955; &#8712; (1/n, 1). If s &#8805; qs &#955; for a sufficiently large constant q, then Assumption A3 (b) holds for The proof of Lemma 3.2 relies on the bound of the tail sums of empirical eigenvalues in Lemma 3.1. We point out that obtaining the eigen-decomposition in Example 4 is as burdensome as computing the matrix inverse, which is not preferred in practice. Rather, the purpose of this example is to directly illustrate one situation that Assumption A3 can be satisfied.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Testing Consistency</head><p>In this section, we derive the null limit distribution of (standardized) T n,&#955; as standard Gaussian, and then extend our result to the case of composite hypothesis testing. Theorem 3.3. Suppose that &#955; &#8594; 0 and s &#8594; &#8734; as n &#8594; &#8734;.</p><p>Suppose Assumption A2 is satisfied. Then under H 0 , we have</p><p>Here,</p><p>Theorem 3.3 holds once s diverges (no matter how slowly). Theorem 3.3 implies the following testing rule at significance level &#945;:</p><p>where z 1-&#945;/2 is the 100 &#215; (1 -&#945;/2)th percentile of N (0, 1).</p><p>As an important consequence of Theorem 3.3, we comment that the optimal estimation rate in <ref type="bibr">[7]</ref> can also be obtained as a by-product, that is</p><p>where r 2 n,&#955; = &#955; + &#181; n,&#955; . The proof of ( <ref type="formula">12</ref>) is sketched as follows. Suppose that f 0 &#8712; H is the "true" function in <ref type="bibr">(1)</ref>.</p><p>where E is the expectation w.r.t. = ( 1 , &#8226; &#8226; &#8226; , n ) with i the random noise in the regression function defined in <ref type="bibr">(1)</ref>. By direct examinations, it can be shown that</p><p>This completes the proof of <ref type="bibr">(12)</ref>. The above discussions are summarized in the following corollary.</p><p>Corollary 3.4. Suppose that 1/n &lt; &#955; &lt; 1 and Assumption A1-A3 holds. Then with probability approaching one, it holds that</p><p>where r 2 n,&#955; = &#955; + &#181; n,&#955; and C is an absolute constant. From Corollary 3.4, the best upper bound can be obtained through balancing &#955; and &#181; n,&#955; . Denote &#955; &#8224; the optimizer. This in turn provides a lower bound s &#8224; for s according to <ref type="bibr">(10)</ref>, i.e., s &#8224; = s &#955; &#8224; . In Section 3.5, we will show that the upper bound under &#955; &#8224; is minimax optimal, and further provide explicit orders for s &#8224; in concrete settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Power Analysis</head><p>In this section, we investigate the power of T n,&#955; under a sequence of local alternatives by assuming f &#8712; B = {f &#8712; H : f H &#8804; C} for some existing constant C.</p><p>For any generic 0-1 valued testing rule &#966; where &#966; = 0 if H 0 is preferred and 1 otherwise, define the total error Err(&#966;, d n ) of &#966; under a separation rate d n &gt; 0 as</p><p>Notice E(&#966; | H 0 is true) is the probability of making a type I error and E(1 -&#966; | H 1 is true) is the probability of making a type II error, and the total error represents the maximum possible type I error and type II error. The separation rate d n is used to measure the distance between the null and the alternative hypotheses. Intuitively, the smaller d n is, the harder it is to distinguish the alternative hypothesis from the null. For any &#949; &#8712; (0, 1), define the minimax separation rate d * n (&#949;) as</p><p>where the infimum in <ref type="bibr">(15)</ref> is taken over all 0-1 valued testing rules based on samples ((x 1 , y 1 ), . . . , (x n , y n )). d * n (&#949;) characterizes the smallest separation between the null and local alternatives such that there exists a testing approach with a total error of at most &#949;. <ref type="bibr">[24]</ref> [25] established the minimax separation rate and revealed its difference with optimal estimation rate. Their work are derived based on Gaussian sequence model with focus on the Sobolev class or the Besov class with polynomial decaying eigenvalues. Recently, <ref type="bibr">[26]</ref> established the minimax nonparametric testing rate under a general eigen-decaying framework including the polynomial decay and exponential decay kernels. Next, we show that our proposed Wald-type test can achieve the minimax separation rate under appropriate &#955; and s.</p><p>For any f &#8712; H, define the squared separation rate</p><p>In the following Theorem 3.5, we show that T n,&#955; can achieve high power provided that s diverges fast enough and the local alternative is separated from the null by at least an amount of d n,&#955; . It is sufficient to minimize the separation rate d n,&#955; to achieve optimal testing. We show that our test can achieve the minimax rate of testing by selecting &#955; to balance the trade-off between the bias of f R and the standard derivation of T n,&#955; shown in <ref type="bibr">(16)</ref>. Theorem 3.5. Suppose that 1/n &lt; &#955; &#8594; 0 as n &#8594; &#8734;, Assumption A1-A2 are satisfied, and Assumption A3 holds for c 1 , c 2 &#8712; (0, &#8734;]. Then for any &#949; &gt; 0, there exist positive constants C &#949; and N &#949; such that, with probability greater than 1 -e -c1s -e -c2s &#955; ,</p><p>where d n,&#955; := &#955; + &#963; n,&#955; and B = {f &#8712; H : f H &#8804; C} for some existing constant C and P f (&#8226;|x, S) is the conditional probability measure under f given x, S.</p><p>In view of Theorem 3.5, to maximize the power of T n,&#955; , one needs to minimize d n,&#955; = &#955; + &#963; n,&#955; through balancing &#955; and &#963; n,&#955; . Denote &#955; * the optimizer. The lower bound s * for s is obtained via <ref type="bibr">(10)</ref>, i.e., s * = s &#955; * . The explicit forms of &#955; * and s * varies for different reproducing kernels, and lead to specific optimal testing rate, depending on their eigendecay rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Parametric versus nonparametric fits</head><p>In practice, it is often of interest to test certain structure of f , e.g., linearity,</p><p>where L(X ) is the class of linear functions over X &#8838; R a . Testing H linear 0 can be easily converted into simple hypothesis testing problem. Intuitively, if the parametric structure is right, then its residuals should be patternless and independent of input features. So we can apply non-parametric smoothing to the parametric residuals and see if their pattern is approximately zero everywhere; see Section 4.2 in <ref type="bibr">[25]</ref>.</p><p>Denote f * n as the least square estimator of f , f R as the randomly projected KRR estimator, and f 0 is a hypothesized "true" parameter with unknown value. Write</p><p>It can be shown that T Suppose Assumption A2 is satisfied. Then under H linear 0 , we have</p><p>Here, &#181; * n,&#955; := E H0 {T * n,&#955; |x, S} = tr(&#8710; 2 )/n, &#963; * 2 n,&#955; := Var H0 {T * n,&#955; |x, S} = 2tr(&#8710; 4 )/n 2 To characterize the power of the composite hypothesis testing problem in <ref type="bibr">(17)</ref>, we define the test error under a separation rate d n &gt; 0,</p><p>where &#966; * is any desion rule for hypothesis testing problem in <ref type="bibr">(17)</ref> and P L (X )f is the projection of f on L(X ). Notice that f -P L (X )f = 0 under the null hypothesis; the magnitude of ||f -P L (X )f || 2 n characterize how far the f is deviated from a linear function. Since the plugin estimate f * n approaches P L(X ) f with 1/n rate, the separation rate for the decision rule for &#966;</p><p>) is the same as &#966; n,&#955; given in Theorem 3.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Examples</head><p>Next, we derive the lower bounds for s to achieve optimal estimation and testing in two featured examples: PDK and EDK, based on the main results obtained in Corollary 3.4 and Theorem 3.5. It is easy to check that Assumption A1 and A2 hold for these two examples; see Section S.5.3 in Supplementary.</p><p>Theorem 3.7. For the two kinds of eigenvalue decaying rates, suppose Assumption A3 holds. Suppose that 1/n &lt; &#955; &#8594; 0 as n &#8594; &#8734;, then with probability approaches 1, it holds that &#181; n,&#955; s &#955; /n and &#963; 2 n,&#955;</p><p>Furthermore, we have the following optimal estimation and testing rates by properly choosing the tuning parameters and the lower bound of projection dimension:</p><p>&#8226; Polynomially decaying kernel (with &#181; i i -2m ) -When &#955; n -2m 2m+1 and s n</p><p>4m+1 and s n 2 4m+1 with m &gt; 3/2, T n,&#955; achieves the minimax optimal rate of testing n -2m 4m+1 .</p><p>&#8226; Exponentionally decaying kernel (with &#181; i exp(-&#947;i p ))</p><p>1 , T n,&#955; achieves the minimax optimal rate of testing n -1 2 (log n)</p><p>.</p><p>The derivation of optimal estimation and testing rates is attributed to the accurate characterization of &#181; n,&#955; and &#963; n,&#955; by employing Lemma 3.1 to bound the tail sums of empirical eigenvalues.</p><p>Plugging in &#181; n,&#955; s &#955; /n to Corollary 3.4, we have f R with the convergence rate r 2 n,&#955; = &#955; + s &#955; /n. As shown in <ref type="bibr">(13)</ref>, &#955; is the squared bias of f R , and s &#955; /n quantifies the variance of f R . Hence, the optimal estimation rate r &#8224;2 n,&#955; is achieved via the bias-variance tradeoff as follows</p><p>We denote the choice of &#955; and lower bound of s to achieve the optimal estimation rate as &#955; &#8224; and s &#8224; respectively. To find the lower bound for s in achieving optimal testing, by Theorem 3.5, d 2 n,&#955; &#955; + &#8730; s &#955; /n, then the optimal separation rate d * n can be achieved by another type of tradeoff, i.e., the squared bias of f R v.s. the standard derivation of T n,&#955; , as follows</p><p>We use &#955; * and s * to represent the optimal &#955; and s &#955; to achieve d * n 2 . Take PDK as an example, plugging in &#181; i i -2m to the definition of s &#955; in (10), we have s &#955; &#955; -1 2m . Then Theorem 3.7 can be directly achieved based on the above two types of tradeoffs in estimation and testing. The results for EDK can be achieved similarly. We conclude our findings of this section in the following Table <ref type="table">1</ref>.</p><p>It is worth emphasizing that &#955; &#8224; , s &#8224; are different from &#955; * , s * due to different types of trade-off discussed above, indicating a fundamental difference between estimation and testing ( <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>). Figure <ref type="figure">2</ref> summarizes the two different types of trade-off to achieve the minimax rate in estimation and testing.</p><p>In the following Theorem 3.8, we show that the upper bound between f R and f n can fall below the statistical error f n -f 0 n by further increasing the projection dimension.</p><p>1. In fact, for EDK, s * (log n -1 2p log log n) 1/p . For simplicity, we keep the main term s * (log n)  This result improves upon existing approximation error bound in <ref type="bibr">[7]</ref> that is of the same order as the statistic error. When s &gt; s &#8224; , the difference between f R and f n is ignorable compared with the difference between f n and f 0 . Theorem 3.8. Let &#948; &gt; 0 satisfying &#948; &#8804; &#955;. Define s = argmin{j : &#181; j &lt; &#948; }. Suppose s &#8805; cs , then with probability approaching 1,</p><p>Furthermore, when &#948; &#955; and &#955; &#8594; 0 as n &#8594; &#8734;,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6">Sharpness of s &#8224; and s *</head><p>In this section, we will show that s * and s &#8224; derived in PDK and EDK are actually sharp. For technical convenience, define</p><p>Our first result is about the sharpness of s &#8224; . Theorem 3.9 shows that when s s &#8224; , there exists a true function f such that f R -f 2 n is substantially slower than the optimal estimation rate. Our proof is constructive in the sense we construct the above true function as n i=1 K(x i , &#8226;)w i with w i being selected from the orthogonal complement of a subsapce properly generated by S and K; see Appendix 7.2 for details. Theorem 3.9. Suppose s = o(s &#8224; ). Then for any s &#215; n random matrix S satisfying Assumption A3, with probability greater than 1 -e -cn&#948;n , it holds that</p><p>where c is a constant independent of n.</p><p>Our second result is about the sharpness of s * . Theorem 3.10 shows that when s s * , there exists a local alternative f that is not detectable by T n,&#955; even when it is separated from zero by d * n . In this case, the asymptotic testing power is actually smaller than &#945;. The proof of Theorem 3.10 is similar as that of Theorem 3.9, except that a different true function is constructed; see Appendix 7.3 for details. </p><p>where c is a constant independent of n. Recall 1 -&#945; is the significance level.</p><p>In view of Theorems 3.5 and 3.10, we observe a subtle phase transition phenomenon for testing signals as shown in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">ADAPTIVE TESTING</head><p>In this section, we focus on the case of PDK as a leading example, and construct an adaptive testing procedure that does not require any exact prior knowledge on m except for m &#8805; 2. The adaptive procedure is proven to achieve the minimax rate of testing established by <ref type="bibr">[44]</ref> (up to an iterative-logarithmic term).</p><p>Consider an RKHS generated by a PDK of order m * &#8805; 2, i.e., H = H m * . To reflect the role of m, we modify all previous notation by adding a subscript m. For example, let K m (&#8226;, &#8226;) be the reproducing kernel function associated with H m , and K m = 1 n [K m (x i , x j )] 1&#8804;i,j&#8804;n be the corresponding empirical kernel matrix. Let S m be an s m &#215; n projection matrix. We will construct the corresponding f R,m (&#8226;) based on <ref type="bibr">(7)</ref> under S m and &#955; m . Here</p><p>and the corresponding projection dimension s m is an integer satisfying</p><p>where q &gt; 0 is a sufficiently large constant.</p><p>Given each m, the sketched KRR estimator f R,m = &#8710; m y, where</p><p>Denote m n (log n) d0 for a constant d 0 &#8712; (0, 1/2). Based on T n,m , our adaptive testing procedure is constructed as follows.</p><p>Step 1. For any 2 &#8804; m &#8804; m n &#8594; &#8734;, standardize T n,m as</p><p>.</p><p>Step 2. Calculate &#964; * n = max 1&#8804;m&#8804;mn &#964; m .</p><p>Step 3. Find</p><p>By allowing m n &#8594; &#8734;, the unknown m * will be eventually covered over a sequence of test statistics. Under the null hypothesis (8), T n,m = 1 n &#916; 2 m , and thus &#964; m is of a standardized quadratic form. Then, &#964; * n is the maxima of a sequence of dependent &#964; m 's. Based on a recent Gaussian approximation result in <ref type="bibr">[31]</ref>, i.e., Lemma S.4 (stated in Supplementary), we prove in the following Theorem 4.1 that the null limit distribution of &#964; n,mn is some extreme value distribution.</p><p>Theorem 4.1. Suppose that m n (log n) d0 for a constant d 0 &#8712; (0, 1/2), and, for 2 &#8804; m &#8804; m n , S m satisfies Assumption A3 (b) with projection dimension s m . Then, under H 0 in (8), for any &#945; &#8712; (0, 1), it holds that</p><p>Our next result states that the above adaptive testing procedure is asymptotically minimax optimal. Specifically, Theorem 4.2 shows that &#964; n,mn achieves high power if the local alternative is separated from zero by an order &#948;(n, m * ) defined as</p><p>And, <ref type="bibr">[44]</ref> showed that &#948;(n, m * ) is minimax optimal rate for adaptive testing.</p><p>Theorem 4.2. Suppose that m n (log n) d0 for a constant d 0 &#8712; (0, 1/2), and S m satisfies Assumption A3 (b) with projection dimension s m . Then, for any &#949; &gt; 0, there exist positive constants C &#949; , N &#949; for any n &#8805; N &#949; , with probability approaching 1,</p><p>In the end, we point out that the lower bound for s m given in <ref type="bibr">(20)</ref> is slightly smaller than the sharp lower bound for s derived in the non-adaptive case; see Table <ref type="table">1</ref>. This is not surprising since the corresponding minimax rate &#948;(n, m * ), i.e., <ref type="bibr">(23)</ref>, is larger than the non-adaptive rate, i.e., n -2m * /(4m * +1) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">NUMERICAL STUDY</head><p>In this section, we examine the performance of the proposed testing procedure through simulation studies in Sections 5.1 and 5.   &#8764; N (0, 1) and c is a constant. To fit the model, we consider the periodic Sobolev kernel with eigenvalues &#956; 2i = &#956; 2i-1 = (2&#960;i) -2m for i &#8805; 1; see <ref type="bibr">[46]</ref> for details. Set n = 2 9 , 2 10 , 2 11 , 2 12 , and H 0 : f = 0. The significance level was chosen as 0.05 and the Gaussian random projection matrix was applied in this setting.</p><p>We examined the empirical performance of the distancebased test (DT) T n,&#955; , and adaptive test (AT) &#964; n,mn . For DT, the projection dimension s was chosen as 2n &#947; for &#947; = 1/(4m+1), 2/(4m+1), 3/(4m+1), with m = 2 corresponding to cubic splines. For AT, the projection dimensions s m was chosen as 2n &#947; (log log n)</p><p>We propose a data-adaptive smoothing parameter slection rule to guarantee testing optimality. Specifically, we choose the optimal smoothing parameter &#955; * satisfying</p><p>where &#963; n,&#955; is defined in Theorem 3.3 as &#963; n,&#955; = 2trace(&#916; 4 )/n 2 with &#916; = KS (SK 2 S + &#955;SKS ) -1 SK as a projection version of the classical smoothing matrix.</p><p>Empirical size was evaluated at c = 0, and power was evaluated at c = 0.01, 0.02, 0.03. Both size and power were calculated based on 500 independent replications. Figure <ref type="figure">3</ref> shows that the size of both DT and AT approach the nominal level 0.05 under various choices of (s, n), demonstrating the validity of the proposed testing procedure. Figure <ref type="figure">4</ref> displays the power of DT and AT. Under various choices of c and &#947;, it is not surprising to see from Figure <ref type="figure">4</ref> (a), (c), and (e) that the power of DT approaches one as n or c increases. Rather, a key observation is that the power cannot be further improved as &#947; grows beyond the critical point 2/(4m + 1) when c &#8805; 0.02. This is consistent with our theoretical result; see Theorem 3.10. Similar patterns have been observed for the power of AT in Figure <ref type="figure">4</ref> (b), (d), and (f ). Of course, the power of AT is usually lower than that of DT under the same setup, especially when the signal strength is weak. This is the price paid for adaptivity. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Simulation Study II: EDK</head><p>In this section, we consider a multivariate case and test H 0 : f = 0. Data were generated from where (x i1 , x i2 , x i3 ) follows from N (&#956;, I 3 ) with &#956; = (0, 0, 0), i &#8764; N (0, 1), and c &#8712; {0, 0.05, 0.1, 0.15}. Specifically, we chose the Gaussian kernel</p><p>We considered sample sizes n = 2 9 to n = 2 12 and sketch dimensions s = 1.2 log(n), 1.2(log n) 3/2 , 1.2(log n) 2 . For each pair (n, s), experiments were independently repeated 500 times for calculating the size and power. Interpretations for Figure <ref type="figure">5</ref> about the size and power are similar to those for Figures <ref type="figure">3</ref> and<ref type="figure">4</ref>. Interestingly, we observe that the power increases dramatically as &#947; increases from 1 to 1.5, while becomes stable near one as &#947; &#8805; 1.5. This is consistent with Theorem 3.7. Figure <ref type="figure">6</ref> demonstrates the significant computational advantage of DT (corresponding to &#947; &lt; 1) over the testing procedure based on standard KRR (corresponding to &#947; = 1).</p><p>In the supplementary, we conduct additional synthetic experiments under the same simulation setup as Sections 5.1 and 5.2 except for using the Bernoulli random matrix. As shown in Figures S.1-S.3, the interpretations remain the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSION</head><p>The main contribution of this paper is to apply random projection to nonparametric testing, and propose the first "sharp" result regarding projection dimension to guarantee minimax optimal testing respectively for a general class of random projections. In practice, many other random projections also satisfy K-satisfiability, for example, the randomized orthogonal system (ROS) sketches introduced   in <ref type="bibr">[47]</ref>, <ref type="bibr">[7]</ref>, by randomly sampling and rescaling the rows of a fixed orthonormal matrix. For the ROS examples, Assumption A3 also holds with extra logarithm factor in projection dimension, see Lemma S.8 in supplementary. Stochastic approximation is another computationally efficient method for nonparametric learning. A representative approach is the stochastic gradient descent (SGD) algorithm. In SGD, the total step size within n steps iteration plays the role of the regularization to avoid overfitting, and the SGD estimator ( <ref type="bibr">[16]</ref>) can achieve the optimality provided that the total step size has the same order of the effective dimension which is represented by s &#955; in our work. <ref type="bibr">[48]</ref> established an early stopping rule for gradient descent algorithm to achieve optimal nonparametric testing. The result shows that the total step size in gradient descent plays the same role as 1/&#955; in classic KRR, and the optimal testing rate can be achieved by the same "bias-standard deviation" tradeoff while the bias and the standard deviation are represented as functions of the total step sizes. In sketched KRR, we showed that the projection dimension is required to be greater than the effective dimension s &#955; to guarantee high power performance. Therefore, although the sketched KRR and SGD are two different computationally efficient approaches, they are connected through the statistical effective dimension.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">PROOF OF MAIN RESULTS</head><p>In this section, we present the main proofs of Lemma 3.1, sharpness properties including Theorem 3.9, Theorem 3.10.</p><p>Proofs for the rest of Lemmas and Theorems can be found in the Supplementary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Proof of Lemma 3.1</head><p>Before the proof of Lemma 3.1, we first state some definitions and preliminary lemmas. Define &#954; &#955; = s &#955; n&#955; , for any &#955; &gt; 0. In fact, &#954; &#955; is the variance-to-bias ratio, where &#955; and s &#955; /n correspond to (squared-)bias and variance of f R , respectively; see Corollary 3.4 and Lemma S.3 for details. Consider a bundle of function classes indexed by &#954; &#955; :</p><p>Define a generalized version of local Rademacher complexity function</p><p>where &#963; 1 , . . . , &#963; n are independent Rademacher random variables, i.e., P(&#963; i = 1) = P(</p><p>When &#954; &#955; 1, &#936; &#955; (&#8226;) and &#936; &#955; (&#8226;) become the original local Rademacher complexity functions introduced in <ref type="bibr">[32]</ref>. Note that &#954; &#955; 1 actually corresponds to the optimal bias vs. variance trade-off required for estimation. Rather, a different type of trade-off is needed for optimal testing as revealed by <ref type="bibr">[22]</ref>, <ref type="bibr">[24]</ref>, which corresponds to a different choice of &#954; &#955; in F &#955; as demonstrated in Section 3.</p><p>In the following Lemma 5 we represent the generalized local Rademacher complexity function and its empirical version by a function of eigenvalues and &#954; &#955; . In Lemma 6, we further show that both &#936; &#955; and &#936; &#955; possess unique (positive) fixed points. This fixed point property is crucial in proving Lemma 3.1. We defer the proof of Lemma 5 and Lemma 6 to Section S.5.2 in the Supplementary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 5.</head><p>(a) Suppose &#181; 1 &gt; 1/n. For any &#955; &gt; 1/n, it holds that</p><p>(b) For any &#955; &gt; 0, it holds that</p><p>Lemma 6. There exist uniquely positive r &#955; and r &#955; such that &#936; &#955; (r &#955; ) = r &#955; and &#936; &#955; ( r &#955; ) = r &#955; . Furthermore, if &#955; &gt; 1/n, then r &#955; s &#955; /n, and there exists an absolute constant c &gt; 0 such that, with probability at least 1 -e -cs &#955; , r &#955; s &#955; /n.</p><p>We are ready to prove Lemma 3.1.</p><p>Proof Plugging these fixed points into <ref type="bibr">(26)</ref> and <ref type="bibr">(27)</ref> in Lemma 5, we have</p><p>By Lemma 6, we have r &#955; s &#955; /n, leading to r &#955; /&#954; &#955; &#955;; for the empirical version, with probability at least 1e -cs &#955; , r &#955; s &#955; /n leading to r &#955; /&#954; &#955; &#955;. Recall that s &#955; = argmin{i : &#181; i &#8804; &#955;} -1. Then by <ref type="bibr">(29)</ref>, with probability at least 1 -e -cs &#955; , where &#945; = (SK 2 S + &#955;SKS ) -1 SKy. Since the vector (SU D 1/2 ) &#945; belongs to the column space of D 1/2 U S &#8712; R n&#215;s , and dim(D 1/2 U S ) = s. Then</p><p>By choosing &#946; * orthogonal to the span of U S , so that D 1/2 &#946; * is orthogonal to D 1/2 U S , and then we have A 3 = 0. Applying the minimax characterization of eigenvalues of the n &#215; n matrix D 1/2 , we have with probability greater than 1 -e -cn&#948;n , when s s &#8224; ,</p><p>where c is a constant. Proof Without loss of generality, here we consider H 0 : f = f 0 with f 0 = 0. We need to prove that when projection dimension s is too small, for any random matrix S, there exists true function f , such that f H &#8804; C, f n &#8805; d * , our testing rule still cannot detect it. We show the existence of such true function f as follows. We construct the true f (&#8226;) = where S = SU . Denote &#8710; = (D S( SD 2 S + &#955; SD S) -1 SD 3/2 ) , then T 1 = n &#8710; D 1/2 &#946; * 2 2 . Notice that the span{&#8710;} &#8834; span{D 3/2 S }, 3/2 S ) = s. There always exists &#946; * &#8712; R n&#215;1 such that D 1/2 &#946; * is orthogonal to the column space of D 3/2 S , so that &#8710; D 1/2 &#946; * = 0 leading to T 1 = 0. Furthermore, based on the above argument, we have T 3 = 0. Then we have T n,&#955; -&#181; n,&#955; &#963; n,&#955; = T 2 /n -&#181; n,&#955; &#963; n,&#955; d -&#8594; N (0, 1).</p><p>On the other hand, the signal of such f satisfies </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>n , where f R is a randomly sketched KRR estimator, and &#8226; n is the empirical norm. The sketched KRR estimator f R (<ref type="bibr">[7]</ref>) enjoys both theoretical support and computational efficiency, especially Authorized licensed use limited to: New Jersey Institute of Technology. Downloaded on August 01,2021 at 18:34:42 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Authorized licensed use limited to: New Jersey Institute of Technology. Downloaded on August 01,2021 at 18:34:42 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
