<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Quasi-Monte Carlo Data Structure for Smooth Kernel Evaluations</title></titleStmt>
			<publicationStmt>
				<publisher>Society for Industrial and Applied Mathematics</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10529469</idno>
					<idno type="doi"></idno>
					
					<author>Moses Charikar</author><author>Michael Kapralov</author><author>Erik Waingarten</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In the kernel density estimation (KDE) problem one is given a kernel K(x, y) and a dataset P of points in a high dimensional Euclidean space, and must prepare a small space data structure that can quickly answer density queries: given a point q, output a (1 + ✏)-approximation to µ := 1 , q). The classical approach to KDE (and the more general problem of matrix vector multiplication for kernel matrices) is the celebrated fast multipole method of Greengard and Rokhlin [1983]. The fast multipole method combines a basic space partitioning approach with a multidimensional Taylor expansion, which yields a ⇡ log d (n/✏) query time (exponential in the dimension d). A recent line of work initiated by Charikar and Siminelakis  [2017]  achieved polynomial dependence on d via a combination of random sampling and randomized space partitioning, with Backurs et al. [2018]  giving an e cient data structure with query time ⇡ polylog(1/µ)/✏ 2 for smooth kernels.Quadratic dependence on ✏, inherent to the sampling (i.e., Monte Carlo) methods above, is prohibitively expensive for small ✏. This is a classical issue addressed by quasi-Monte Carlo methods in numerical analysis. The high level idea in quasi-Monte Carlo methods is to replace random sampling with a discrepancy based approach -an idea recently applied to coresets for KDE by Phillips and Tai [2020]. The work of Phillips and Tai gives a space e cient data structure with query complexity ⇡ 1/(✏µ). This is polynomially better in 1/✏, but exponentially worse in 1/µ. In this work we show how to get the best of both worlds: we give a data structure with ⇡ polylog(1/µ)/✏ query time for smooth kernel KDE. Our main insight is a new way to combine discrepancy theory with randomized space partitioning inspired by, but significantly more e cient than, that of the fast multipole methods. We hope that our techniques will find further applications to linear algebra for kernel matrices.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In the kernel evaluation problem, an algorithm receives as input a dataset P of n points p 1 , . . . , p n 2 R d and must preprocess it into a small-space data structure that allows one to quickly approximate, given a query q 2 R d , the quantity K(P, q) := X p2P K(p, q). (1.1) where K(p, q) is the kernel function. In this paper, we will study positive definite (p.d) radial kernels K. In particular, we consider kernel functions K : R d &#8677; R d ! R 0 given by a decreasing function of the distance, i.e., there exists a function G : R 0 ! R 0 such that K(p, q) = G(kp qk 2 2 ). The restriction that K is positive definite means that for any collection of points x 1 , . . . , x m 2 R d , the m &#8677; m kernel matrix whose (i, j)-entry is K(x i , x j ) is positive definite. Some example of prominent p.d radial kernels include 1 the Cauchy kernel (or, more generally, the rational quadratic kernel) for any 1, and the Student-t kernel, respectively,</p><p>(1.2) A variety of kernels are used in applications <ref type="bibr">[STC+04,</ref><ref type="bibr">RW06]</ref>, and kernel methods are a fundamental approach with numerous applications in machine learning, statistics and data analysis [FG96, SS01, JKPV11, SZK14, GPPV+14, ACMP15, GB17].</p><p>For example, kernel density estimation is a classic tool in non-parametric statistics, where a kernel function is applied to extrapolate a function specified on a discrete set of points to the entire space. This is used in algorithms for mode estimation <ref type="bibr">[GSM03]</ref>, outlier detection <ref type="bibr">[SZK14]</ref>, density based clustering <ref type="bibr">[RW10]</ref> and other problems. Kernel methods (e.g., regression <ref type="bibr">[SSPG16]</ref>) have also been applied to objects specified by point-clouds or distributions <ref type="bibr">[GFKS02]</ref>, requiring summing up the kernel function between all pairs of points across two sets. Another application is kernel mean estimation using an empirical average (see <ref type="bibr">[MFS+17]</ref>, section 3.4.2).</p><p>We are interested in fast and space e cient approximations algorithms for kernel evaluations. An estimator for K(P, q) is a (1 + &#9999;)-approximation to K(P, q) if (1 &#9999;) &#8226; K(P, q) &#63743; K(P, q) &#63743; (1 + &#9999;) &#8226; K(P, q). The kernel density estimation problem has received a lot of attention over the years, with a number of powerful algorithmic ideas leading to di&#8629;erent precision/space/query time tradeo&#8629;s. The focus of this work is algorithms for the so-called "high-dimensional" regime: we will study algorithms whose complexity will depend at most polynomially, as opposed to exponentially in the underlying dimension.</p><p>Prior Work: Fast Multipole Methods. The celebrated fast multipole method <ref type="bibr">[BG97]</ref> can be used to obtain e cient data structure for kernel evaluation (see also the Barnes-Hut algorithm <ref type="bibr">[BH86]</ref>). However, this approach su&#8629;ers from an exponential dependence on the dimensionality of the input data points: it provides a (1 + &#9999;)-approximation to kernel evaluation value using space &#8673; n log d (n/&#9999;) and query time &#8673; log d (n/&#9999;).</p><p>In particular, the exponential dependence on the dimensionality is due to a (deterministic) space partitioning procedure (essentially building a quadtree) which is central to the fast-multipole method. More generally, this deficiency is shared by other tree-based methods for kernel density estimation [GM01, GM03, YDGD03, LMG06, RLMG09]. Methods based on polynomial approximations have recently been used to obtain fast algorithms for kernel graphs <ref type="bibr">[ACSS20]</ref> as well as attention mechanisms in transformers <ref type="bibr">[AS23]</ref> -all these approaches are only e cient in relatively low dimensions.</p><p>Prior Work: Sampling-based approaches (Monte-Carlo methods). A recent line of work [CS17a, CS19, BCIS18b, BIW19, CKNS20a] sought to design sublinear query-time algorithms for kernel density estimation while avoiding exponential dependencies in the dimension (thereby allowing these methods to be scaled to highdimensional spaces). These works parametrize the query time of the data structure in terms of the value &#181; = K(P, q)/n, and the goal is to achieve query times which are significantly faster than O(d/(&#9999; 2 &#181;)), which is what one achieves via uniform random sampling. Surprisingly, <ref type="bibr">[CS17b]</ref> showed how, for several kernels, one can reduce the query time to O(d/(&#9999; 2 p &#181;)) by using the Locality-Sensitive Hashing framework of Indyk and Motwani <ref type="bibr">[IM98]</ref>. Furthermore, <ref type="bibr">[BCIS18b]</ref> (and also <ref type="bibr">[CKNS20a]</ref>) developed the approach further, and showed that for "smooth" kernels, i.e. kernels with polynomial decay, a multiplicative approximation can be obtained with just a polylogarithmic dependence on 1/&#181;. Specifically, they achieve a (1 + &#9999;)-approximation using space &#8673; n &#8226; polylog(1/&#181;)/&#9999; 2 space and (1.3) query time &#8673; polylog(1/&#181;)/&#9999; 2 .</p><p>Prior Work: Quasi-Monte Carlo (i.e., discrepancy-based) methods. In the approaches above, the focus has always been on the dependence on &#181;, and the additional factor of 1/&#9999; 2 a consequence of the samplingbased approach. More broadly, this quadratic dependence on 1/&#9999; is unsurprising. It is common, in approaches via random sampling, to design an estimator and utilize Chebyshev's inequality to upper bound the probability that the estimator deviates from its expectation. On the other hand, in applications where the 1/&#9999; 2 dependence (coming from random sampling) is prohibitively expensive, a common approach in numerical analysis is to use quasi-Monte Carlo methods.</p><p>At a high level, these techniques are based on discrepancy theory. They seek to find a "random-like" set of samples (such that the estimator will approximate the expectation), and at the same time, minimize the "random deviations" that one expects from random sampling. In the context of kernel density estimation, this idea was used in the beautiful work of Phillips and Tai <ref type="bibr">[PT20]</ref>, who used discrepancy theory to build a small coreset for kernel density estimation. As we detail in Section 2, standard subsampling approaches correspond to chosing the colors uniformly at random, but <ref type="bibr">[PT20]</ref> show that a coloring with much lower discrepancy can be constructed using Banaszcyk's theorem. Repeatedly 'subsampling' using these low-discrepancy colorings, they give a data structure with space &#8673; polyd/(&#9999;&#181;) and</p><p>(1.4) query time &#8673; polyd/(&#9999;&#181;).</p><p>This gives yet another way to improve on the O(d/(&#9999; 2 &#181;)) query time that one achieves via uniform random sampling. 2 In the work of <ref type="bibr">[PT20]</ref>, the focus is on the dependence on &#9999; and the dependence on 1/&#181; remains linear. 3  Our contribution. In this work, we show how to achieve the best of both (1.3) and (1.4) quantitatively. Namely, we show how one may use randomized partitioning techniques (like those in <ref type="bibr">[BCIS18a,</ref><ref type="bibr">CKNS20a]</ref>) to obtain a poly-logarithmic dependence on 1/&#181; for smooth kernels, and at the same time, a linear dependence on 1/&#9999;. The resulting data structure will obtain space complexity &#8673; n &#8226; polyd log(1/&#181;)/&#9999; and query time &#8673; polyd log(1/&#181;)/&#9999;. At a qualitative level, our work gives new structural results and algorithmic techniques for both the samplingbased and quasi-Monte Carlo-based approaches, and we are hopeful that these techniques will prove useful for fast algorithms in related tasks for kernels in high-dimensional spaces.</p><p>From the perspective of the sampling-based methods, our work shows that the quadratic dependence on 1/&#9999; is not intrinsic to the randomized approaches for high-dimensions, and that quasi-Monte Carlo (discrepancybased) techniques can be used to design kernel density estimators in high-dimensions. From the perspective of the quasi-Monte Carlo methods, our work shows that, if one allows randomized data structures, then randomized space partitioning can give exponential improvements on the &#181;-dependence of discrepancy for smooth kernels; from polyd/(&#9999;&#181;) to roughly polyd log(1/&#181;)/&#9999;. In what follows, we will formally state our results and some open problems, and give an outline of our techniques.</p><p>Our results. Our focus is on developing fast data structures for "smooth" p.d radial kernels. We reproduce the definition of "smooth" kernel from <ref type="bibr">[BCIS18a]</ref> below, and then we state our main result. As we will soon see, the smoothness condition will become important in the technical details. After discussing the main result, we will give a few open problems and highlight a few concrete challenges involved in obtaining improved &#9999;-dependencies for non-smooth kernels (like the Gaussian kernel).</p><p>Remark 1.1. We remark that the definition above encompasses kernels with a polynomial decay, such as, for example, the rational quadratic kernel and its variants with polynomial decay (1.2). While certainly less popular than the Gaussian kernel, the rational quadratic kernel is commonly listed among the standard covariance functions (i.e. kernels) in the literature on Gaussian processes. See, e.g., Section 4.2.1 of <ref type="bibr">[RW06]</ref>, as well as Section 5.4 of the same book, which in particular presents settings where the rational quadratic covariance assumption leads to improvements of the Gaussian.</p><p>Theorem 1.1. (Main result (Informal -see Theorem 4.1)) For every &#9999; 2 (0, 1) and &#181; &gt; 0 there exists a randomized data structure for kernel evaluation of (L, t)-smooth p.d radial kernels K with polynomial preprocessing time,</p><p>which outputs a multiplicative (1 &#177; &#9999;)-approximation to kernel evaluations whenever the kernel evaluation is at least &#181;n, where is the aspect ratio of the points. . This means that exponential dependencies on d should be avoided (as they incur super-polynomial factors in n), but arbitrary polynomial dependencies are allowed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3</head><p>In fact, the data structure of <ref type="bibr">[PT20]</ref> has a Las Vegas guarantee; it uses randomness to build the data structure, but then guarantees that all queries are correct.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4</head><p>The dependence on seems necessary when making no assumptions on the smooth kernels. This dependence can be removed under a mild assumption on the decay of the kernel (which holds for the Cauchy and Student-t kernels listed as examples).</p><p>The main conceptual contribution behind the result is a framework for combining discrepancy techniques with randomized space partitioning. To the best of our knowledge, ours is the first work that succeeds in combining these lines of work. At a technical level the two main innovations that we introduce are (a) a strong bound on the discrepancy of the kernel matrix for smooth kernels under a geometric separability assumption akin to the one used in Fast Multipole Methods and (b) a way to partition a dataset (and potential queries) into a small number of pieces that ensure separation.</p><p>Dependence on . We note that a dependence on the aspect ratio of the dataset is also present in <ref type="bibr">[BCIS18a]</ref>, and seems necessary for general smooth kernels. For instance, a natural approach to removing it would be to (a) remove data points that are poly1/(&#9999;&#181;) far from the query point from consideration and (b) discretize points without significantly changing the kernel values but ensure that non-equal points have a minimum distance-(a) and (b) e&#8629;ectively upper bounds the aspect ratio by upper bounding the maximum distance and lower bounding the minimum distance. However, this does not work without a closer look at the specific kernel function K. For example, the counting kernel K(x, y) = 1 for all x, y would mean one cannot do (a). For (b), consider the kernel K(x, y) = 1 1+ 2 kx yk 2 , where we let go to infinity. This kernel is smooth as per our definition for every , independent of L. At the same time, the value of the kernel starts dropping sharply at kx yk 2 &#8673; 1/ , so one should not give a discretization independent of . At the same time, the dependence on can be removed for specific kernels such as the Cauchy or the t-Student kernel using the approach outlined above.</p><p>Dependence on (d log(n /(&#9999;&#181;))) O(t) . Our techniques will naturally incur a poly-logarithmic dependence on d log(n /(&#9999;&#181;), where the specific power of the exponent will be O(t) for (L, t)-smooth kernels K. For the Cauchy kernel or Student-t kernel with and t being O(1), the additional factors are poly-logarithmic. Interestingly, the dependence on the smoothness t in <ref type="bibr">[BCIS18a]</ref> is 2 O(t) , which becomes important once t = &#8677;(log n).</p><p>1.1 Future Directions and Open Problems We hope that our techniques, which allow one to combine discrepancy methods with randomized space partitioning, will prove instrumental in resolving other exciting questions in numerical linear algebra with kernel matrices. We mention two prominent questions here.</p><p>&#8226; Kernel density estimation for Gaussian Kernel. Does there exist a data structure for kernel density estimation for the Gaussian kernel (i.e., K(p, q) = exp( kp qk 2 2 /2)) with polynomial space complexity and query time polyd/(&#9999; &#8226; &#181; 0.99 )? The Gaussian kernel is not "smooth" so the approach from this paper would degrade to (d log(1/&#181;)) &#8998;(d) /&#9999;. A specific hurdle is that our partitioning techniques only ensure an &#8998;(1/ p d) relative separation of the query and dataset (here 'relative' refers to the size of a bounding Euclidean ball -see Section 2 for more details). However, a natural extension of our data-dependent embeddings to the Gaussian kernel incurs an exponential dependence on the inverse of this relative separation -see Remark 3.1 for more details.</p><p>&#8226; Fast Multipole Methods with a polynomial dependence on dimension. In fast multipole methods one approximates a kernel matrix by first using a crude (`1 ball carving) space partitioning to partition space into bounded regions, and then Taylor expands the kernel arounds centers of these regions to approximate interactions between well-separated data points (see, e.g., <ref type="bibr">[BG97]</ref>). Both steps incur an exponential in the dimension loss: the former because a constant factor separation is ensured in the crude `1 ball-carving used, which requires breaking an `1 ball into d &#8998;(d) balls of factor p d smaller radius. The latter because a Taylor expansion has an exponential number of terms in the dimension of the dataset. The latter exponential loss was recently overcome by the work of <ref type="bibr">[AKK+20a]</ref>, who showed how to sketch the polynomial kernel (i.e., the Taylor expansion) with only a polynomial in d dependence. However, the approach of <ref type="bibr">[AKK+20a]</ref> su&#8629;ers from a polynomial dependence on the radius of the dataset, as they do not supply a corresponding space partitioning primitive. Our space partitioning methods are able to exploit only a &#8998;( 1 p d ) relative separation, and only result in a polynomial in d dependence: can our techniques be used together with <ref type="bibr">[AKK+20a]</ref> to optimally sketch kernel matrices?</p><p>2 Technical Overview In this section, we give an overview of the techniques involved. In order to highlight our main contributions, it will be useful to consider the case of the 2-Student kernel K : R d &#8677; R d ! [0, 1] for concreteness. This kernel is given by</p><p>One receives a dataset P &#8674; R d of points and seeks to process them into a data structure to support kernel evaluation queries. Namely, a query q 2 R d will come, and we want to estimate P p2P K(p, q) up to a factor of 1 &#177; &#9999;, and similarly to [CS17a, CS19, BCIS18b, BIW19, CKNS20a] we will parametrize our time and space complexity in terms of &#181; = (1/n) P p2P K(p, q). We start by explaining two prior approaches which will be important ingredients in our scheme. First, a simple random sampling approach, and then the prior work of <ref type="bibr">[PT20]</ref> which shows how to use discrepancy theory to obtain an improved dependence on &#9999;. Then, we overview our approach. First, we show how we may improve on the discrepancy bounds when datasets are "well-separated" from a query, and then how to algorithmically utilize the improved discrepancy bounds in the worst-case.</p><p>Random sampling as repeated coloring. It is not hard to see that, since kernel values are always between 0 and 1, a uniformly random sample from P of size O(1/(&#9999; 2 &#181;)) will approximate the kernel evaluation of any q with probability at least 0.9. More generally, one may take samples from an unbiased estimator of (1/n) P p2P K(p, q), and show that the estimate is good via bounding the variance of the estimator. If proceeding with this plan, the quadratic dependence on 1/&#9999; is a consequence of using Chebyshev's inequality; since the probability that the estimate is o&#8629; by more than &#9999;&#181; (the quantity we want to minimize by taking more samples) becomes at most the variance divided by &#9999; 2 &#181; 2 . To facilitate comparison with discrepancy-based approaches, we now sketch an alternative derivation of this result. And for this purpose it will be useful to instead think of repeatedly sampling data points in P with probability 1/2 and analyzing how errors in the corresponding KDE estimates accumulate.</p><p>Suppose that we would like to subsample the dataset P to a subset P 0 containing about half of the points in P while preserving KDE value. It is convenient to think of this process as coloring the points in P : P ! { 1, 1}, and then letting the 'subsampled' dataset P 0 contain points that were colored 1, say:</p><p>We thus would like to find a coloring of P such that</p><p>The left hand side of (2.5) can be expressed as</p><p>K(p, q) =: disc K (P, , q), the discrepancy of coloring with respect to query q. Choosing P 0 to be a uniformly random subset of P containing every data point independently with probability 1/2 amounts to a uniformly random coloring of P , and a simple calculation shows that</p><p>for every kernel bounded by 1, with constant probability. Substituting back into (2.5) and taking the normalizing factor of 1 |P | into account, we get that the error introduced by subsampling is below &#9999;&#181; as long as</p><p>This means that we can keep subsampling 5 while |P | 1/(&#9999; 2 &#181;). This recovers the bound from Chebyshev's inequality 6 .</p><p>Better colorings via Banaszczyk's theorem <ref type="bibr">[PT20]</ref>. Instead of selecting the coloring randomly, [PT20] note that since K is a p.d kernel with K(p, q) &#63743; 1 for all p, q, there exists an embedding such that for every p, q 2 R d one has</p><p>as well as K(p, q) = h (p), (q)i. The existence of a coloring such that</p><p>for all q then follows by Banaszczyk's theorem -see Theorem 3.1. 7 Here 2 (K) is the 2 -norm of the kernel matrix K, which provides a commonly used route for upper bounding discrepancy -see Section 3 for more details. Substituting this bound into (2.5), we get that the error introduced by subsampling is below &#9999;&#181; as long as</p><p>This means that we can keep subsampling while |P | 1/(&#9999;&#181;). This is a quadratic improvement on the &#9999;dependence from random sampling, but far short of our goal of polylog(1/&#181;)/&#9999;. Furthermore, the bound on discrepancy provided by (2.6) is tight in general.</p><p>Our approach: discrepancy bounds for well-separated datasets. In order to get around the tightness of the above bound for general datasets, we show that every dataset P can be decomposed into a small number of datasets that are 'nice' with respect to any fixed query q 2 R d . This decomposition is independent of the query, and relies on randomized space partitioning in high dimensions akin to locality sensitive hashing. We then design specialized feature embeddings for each element in this decomposition to establish a significantly stronger upper bound on the discrepancy of the corresponding sub-dataset with respect to q.</p><p>Our geometric assumptions are inspired by those in Fast Multipole Methods <ref type="bibr">[BG97]</ref>. In the Fast Multipole Methods, the space R d is deterministically recursively partitioned with `1-balls of geometrically decreasing diameter. The benefit of using `1-balls is that they tile R d , and whenever p and q belong to two di&#8629;erent `1-balls which are not adjacent (share a corner) at a particular radius, p and q are separated by at least the diameter. The downside is that, since Euclidean separation ultimately matters, one must decrease the radius by at least a constant (and, in fact, at least p d) factor at every level -and this leads to an exponential dependence in the dimension as the tree encoding the recursive partition has degree exp(&#8998;(d)). Our recursive partitioning will be randomized and use `2-balls of randomly chosen radii. With such a partitioning scheme one can ensure a weaker separation, namely a relative &#8998;( 1 p d ) separation. We show, however, that this su ces! Specifically, we show strong discrepancy bounds when either (1) the dataset P lies in a spherical shell and the query q lies in a slightly larger spherical shell (see Fig. <ref type="figure">1</ref>) or 5 One can verify that the total error induced by a sequence of recolorings is dominated by the error introduced in the last stepsee Section 3 for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6</head><p>Of course, this derivation also uses Chebyshev's inequality in bounding the discrepancy of a random coloring; however, as we show next, it readily generalizes to settings when the coloring is obtained by a more careful method than uniformly random choice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>7</head><p>Banaszcyk's theorem gives a distribution over (random) colorings which achieves a O(1) discrepancy for each q with very high probability, so in general, there is a O( p log m)-factor, where m denotes the number of rows, i.e., queries, of the kernel matrix which one wishes to support. The point c 2 R d is the center of ball, which may be assumed to be the origin after a translation, and we consider two well-separated shells at radii between [r min , r in ] and [r out , r max ] centered at c, where r in &lt; r out . We consider dataset points which lie in the inner shell, with distance between r min and r in from c, and queries which lie in the outer shell, with distance between r out and r max from c. This is the case we consider throughout the technical overview; the symmetric case when the dataset is inside a low radius shell and query is outside will be analogous.</p><p>(2) the query q lies inside a spherical shell and dataset P lies in a slightly larger spherical shell (the opposite of (1)).</p><p>This in particular is where we crucially use the assumption that the kernel is smooth -see Remark 3.1 in Section 3 for more details.</p><p>In what follows we give a more detailed outline of how our feature embeddings are constructed, using the 2-Student kernel as a running example. We first define a basic feature embedding, then explain how to apply discrepancy theory to achieve good colorings via the 2 norm of the embedding and then talk about our modified feature embeddings that achieve strong discrepancy bounds in settings (1) and (2) above.</p><p>Feature Embeddings for Smooth Kernels. The crucial property of positive definite kernels K is that they may be represented as inner products in a (potentially infinite dimensional) feature space. Consider the following explicit construction for the 2-Student kernel, which proceeds by taking the inverse Laplace transform of 1/(1 + kx yk 2 2 ) and a Taylor expansion of e x :</p><p>We may now consider an embedding which takes as input a vector x 2 R d and outputs the (infinite-dimensional) function (x) whose inputs are a number t 2 [0, 1), an index k 2 Z 0 , and sets</p><p>This representation has the benefit that for all x, y 2 R d K(x, y) = h (x), (y)i, where h (x), (y)i = R 1 t:0</p><p>Discrepancy on the Feature Space. The key to applying discrepancy minimization algorithms is understanding the so-called 2 -norm of the kernel matrix, since this will govern the discrepancy that we may achieve (and hence the number of times that we may halve the dataset). Consider the kernel matrix A = (K(q, p)) where rows are indexed by a set of possible queries Q and columns are indexed by dataset points P . The 2 -norm of A is the minimum, over all factorization of A into UV , where U is a |Q| &#8677; d 0 matrix and V is a d 0 &#8677; |P | matrix, of the maximum row norm of Q times the maximum column norm of P . By placing (q) as the rows of U for each potential query in Q and placing (p) as the columns of V for each dataset point in P , we obtain that</p><p>By construction, k (q)k 2 2 = h (q), (q)i = K(q, q) which equals 1 for any q (and similarly p) -this gives the bound on the 2 -norm used by <ref type="bibr">[PT20]</ref>. As we show below, however, significantly stronger upper bounds on the 2 -norm can be obtained if the dataset is well-separated from the query in an appropriate way -this, coupled with a new hashing-based procedure for reducing to the well-separated setting, will lead to our improved bounds.</p><p>Modifying the Feature Embedding. Suppose that every dataset point p in P was contained within a shell of inner radius r min and outer radius r in , and that every query q in Q which we will consider was contained within a shell of inner radius r out (which is larger than r in ) and outer radius r max (the symmetric case when the query is inside a shell close to the origin and the dataset points are in a shell far from the origin will be analogous). See Fig. <ref type="figure">1</ref> for an illustration.</p><p>In Section 3.1, we show that a configuration gives an improved bound on the 2 norm of the kernel matrix. Indeed, the fact that q has norm which is at least r out and every p in P has norm which is at most r in guarantees that the points p and q are not too close to each other, i.e., (2.9) kp qk 2 r out r in .</p><p>For example, if we could support a small additive error &#8672; &gt; 0 (which we will later incur a logarithmic dependence, so we will set &#8672; to &#9999;&#181;/n), then, it su ces to "cut o&#8629;" the feature embedding at</p><p>because for any such "well-separated" pair of points p, q 2 R d , Z 1 t:t0 e t(kp qk 2 2 +1) dt &#63743; e t0(rout rin) 2 Z 1 t:t0 e t dt &#63743; &#8672;.</p><p>Once we introduce this change, we will exploit the fact that kpk 2 2 [r min , r in ] and kqk 2 2 [r out , r max ] in order to modify the embedding to 0 such that the norm of 0 (p) will not increase too much, but the norm of 0 (q) will decrease a significant amount. Overall, we will show that k 0 (p)k 2 k 0 (q)k 2 , which gives an upper bound on the 2 -norm of the matrix, will be much smaller. In particular, for a setting of &#8674; &gt; 1, we introduce the change (2.10)</p><p>and 0 certifies an improved bound on the 2 -norm of an additive &#8672;-perturbation of kernel matrices. In particular, we upper bound the product of k 0 (p)k 2 k 0 (q)k 2 while using the fact that kpk 2 2 [r min , r in ] and kqk 2 2 [r out , r max ].</p><p>We point the reader to Section 3.1 with G(t) = (1 + t) 1 , where we show that this product can be at most,</p><p>The above bound gives us the upper bound on the discrepancy that we will achieve, and this will dictate how many times we may half the dataset and incur at most &#9999;&#181; error. Importantly, since the query q and every dataset point p considered (inside the shell) satisfies by (2.9), we have a lower bound on what each point p from the shell contributes to the kernel evaluation to a query q 2 Q, &#181; min p2P,q2Q</p><p>K(p, q)</p><p>since r in + r max is an upper bound on the maximum distance from q to P . We will be able to ensure that r max = O(r out ), r in /r min = O(1), and that r out r in &#8998;(r out / p d) (we expand on why in the next subsection). Overall, this means that</p><p>Since we had r in /r min = O(1) and (r out r in ) 2 r 2 out /d, when we use the discrepancy bound in (2.11), each step of halving incurs error O(ln(1/&#8672;) &#8226; r 2 out /d), and similarly to the discussion in (2.7), we may continue decreasing the dataset until</p><p>Even though we have specialized the discussion to the 2-Student kernel, we use a characterization of positive definite radial kernels due to Schoenberg in order to use the above embeddings in general. We note that the smoothness assumption come in the following way. Note that the 2 -norm bound depends on how r min , r in and r out relate to each other, which will factor into the number of times we may halve the dataset while incurring at most &#9999;&#181; in the error. This must be compared to K(p, q) (which depends on kp qk 2 ) so that the additive error can be absorbed into &#9999; &#8226; K(P, q)/|P |. (See also, Remark 3.1.) Remark 2.1. Taking a broader perspective on kernel methods for high-dimensions, we are not aware of any prior work which adapt the feature embeddings to the specific dataset for improved algorithms. As an example, sampling techniques for the Nystr&#246;m method <ref type="bibr">[MM17]</ref>, random Fourier features [RR08, AKM+17], and sketching methods [ANW14, ACW17a, ACW17b, AKK+20b] always consider the feature embedding : R d ! L 2 which gives h (p), (q)i = K(p, q) for all p, q 2 R d . Our approach, of dividing the dataset and adapting the feature embeddings to the various parts of the dataset, fits nicely within a recent line-of-work on "data-dependent" techniques for high-dimensional datasets [AINR14, AR15, ALRW17, ANN+18, CKNS20b, CJLW22], and we are hopeful that such techniques are applicable in other algorithmic contexts Maintaining Separation for Coresets. It remains to show that we can build a data structure which always constructs coresets while guaranteeing a separation between queries and dataset points. The idea is to proceed via ball carving of randomly chosen radii. Suppose, for instance, that all dataset points P lie within a ball of radius R &gt; 0, and let c denote the center of that ball. We consider three cases, corresponding to where the (unknown) query q may be in comparison to the ball. The first two cases are relatively simple to handle, and most of the work involves the last case.</p><p>Case 1 (easy): q is much farther than R/&#9999; from c. Then, since any p lies within distance R from c, one may use the triangle inequality to conclude that the kernel evaluation of K(q, p) and K(q, c) is the same up to 1 &#177; &#9999;. In this case, a data structure just needs to remember the center c and the size of the dataset |P |, so that one can output |P | &#8226; K(q, c) to approximate K(P, q). See Definition 4.1 and Claim 4.1.</p><p>Case 2 (relatively easy): q is farther than 3R but within R/&#9999; of c. In this case, we can utilize the coreset construction. Indeed, we "guess" the distance from q to c (for which there are at most O(log(1/&#9999;)) many choices). Let R 0 3R a guessed distance such that kq ck 2 &#8673; 3R 0 . Pick a randomly chosen point c 0 drawn from B 2 (c, R 0 /2) uniformly and use that as our new "center." Every point in P will be within a ball of radius (R 0 /2 + R) &#63743; 3R 0 /2 from c 0 (by the triangle inequality), and at distance at least R 0 /100 from c 0 (with high probability for d = !(log n)). In addition, the query is within distance at least 5R 0 /2 from c 0 . Setting r min = R 0 /100, r in = 3R 0 /2, and r out = 5R 0 /2, we can always guarantee an upper bound on the 2 -norm of O(1/(R 0 ) 2 ). In addition, the K(q, p)</p><p>O(1/(R 0 ) 2 ), so the repeated halving technique of <ref type="bibr">[PT20]</ref> will give the desired coreset. This is handled in Section 4.3.</p><p>Case 3 (the di cult case). This case occurs when q is within 3R from c -see Fig. <ref type="figure">2</ref> for an illustration. Here, we want to partition the space such that, as in the fast multipole method, we can guarantee some amount of separation, without the exponential dependency obtained from partitioning into cubes (as is usually done in the fast multipole methods). Consider a point p from P and a query q such that kp qk 2 = &#8998;(R). We will consider the following randomized partition. First, we sample a random point c 0 within O(R) from c, and then we sample a (random) radius r on the order of R. The hope is that p falls inside the ball around c 0 of radius r, and that q falls outside the ball. In order to apply the improved discrepancy bound for well-separated datasets, we must also fit a shell inner radius r in and outer radius r out , where r out r in = &#8998;(R/ p d). We point the reader to Lemma 4.1, where we show this partitioning procedure separates each p 2 P with kp qk 2 = &#8998;(R) with probability at least &#8998;(1/ p d), so after repeating O( p d log n) times, we are guaranteed to have separated every dataset point p at distance &#8998;(R) from q.</p><p>In the data structure, we sample a ball and consider the points within the inner-shell and the points outside the outer shell. We construct a coreset for the points inside the inner-shell, and we will use these coresets to evaluate queries which come outside the outer shell (to guarantee separation). Whenever queries come outside the outer shell, the points inside the inner shell are "captured," and the coreset approximates their contribution. For these queries, it remains to recurse on the dataset points which were not captured by the coreset. Similarly, we construct a coreset of the dataset points which evaluate queries which fall inside the inner shell (again, to guarantee separation), and we must recurse on the dataset points outside the outer shell. This recursive partitioning scheme is done in Algorithm 4.2.</p><p>Note that there is a small technical issue arising, since the points which fall within radius r in and r out from c 0 are replicated. While this may naively blow up the space of the data structure, the probability that any dataset point falls inside this shell can be made small. The is done by decreasing r out r in to &#8629;R/ p d, at a cost of an increase in the coreset size.</p><p>Organization. In the rest of the paper we first present the formal construction and analysis of our improved coresets for well-separated points in Section 3. We then present the details of our reduction to the case of well-separated datasets and the final algorithm in Section 4.</p><p>3 Structural Result: Improved Coresets for Well-Separated Shells Given a p.d kernel K, the "kernel trick" refers to the fact there exists a feature embedding mapping R d into a much larger and possibly infinite-dimensional space R d 0 where K(x, y) = h (x), (y)i for any x, y 2 R d . As <ref type="bibr">[PT20]</ref> show, whenever K(x, x) = 1 for all x, the existence of such a feature embedding , as well as discrepancy minimization algorithms are useful for constructing coresets for kernel evaluation.</p><p>The approach proceeds as follows: one receives a dataset P &#8674; R d and wants to support kernel evaluation queries for a finite set of queries Q &#8674; R d . 8 Then, consider the |Q| &#8677; |P | kernel matrix A = (a qp ) where the (q, p)-entry is K(q, p). Notice that A &#8226; 1 2 R |Q| (where 1 2 R |P | is the all-1's vector) is the vector of kernel evaluations at each of the queries in Q. If one finds a vector 2 { 1, 1} |P | where kA &#8226; k 1 &#63743; &#8629;, then considering the partition of P into P + and P according to whether a p 2 P has p = 1 (in which case it belongs to P + , and P otherwise) at least one of the subsets is smaller than |P |/2 and for any q 2 Q, one may approximate the 8 <ref type="bibr">[PT20]</ref> show how to discretize Q to only need to consider exp(d)-sized query sets, but since we will allow queries to fail with a small probability, the discretization will not be an issue in this work.</p><p>kernel evaluation by only considering that subset. Formally, for both sets S being P + and P , 2</p><p>In essence, one decreases the size of the dataset by a factor of 2 and incurs an additive error of &#8629; on the kernel evaluation. One can bound &#8629; using discrepancy theory. Definition 3.1. ( 2 -norm of a matrix) For a matrix A 2 R m&#8677;n , the 2 -norm of A is given by</p><p>The final ingredient is showing that 2 (A) &#63743; 1, which is a simple consequence of the existence of . In particular, consider the |Q| &#8677; d 0 matrix U whose rows correspond to (q) for q 2 Q, and the d 0 &#8677; |P | matrix whose columns correspond to (p) for p 2 P . Then, A = UV since is a feature embedding of K(&#8226;, &#8226;), and the maximum row norm of U and column norm of V is at most 1, because k (q)k 2 2 = K(q, q) = 1 and similarly for p. In this section, we will do two things.</p><p>1. First, we explicitly construct new feature embeddings for p.d radial kernels K : R d &#8677; R d ! [0, 1]. For parameters 0 &lt; r in &lt; r out , as well as an additive error &#8672; &gt; 0 (think of &#8672; as extremely small, since we will depend logarithmically on 1/&#8672;), our new feature embedding will only map vectors z 2 R d which satisfy kzk 2 / 2 [r in , r out ], i.e., they avoid a shell of outer radius r out and inner radius r in around the origin. Whenever kyk 2 &#63743; r in and kxk 2 r out , then h (x), (y)i will be up to &#177;&#8672; the same as K(x, y), yet k (x)k 2 &#8226; k (y)k 2 will be smaller than 1. <ref type="bibr">[PT20]</ref> to repeatedly halve the dataset and construct the coreset whenever a query and dataset will be separated by a shell of inner radius r in and outer radius r out , and in addition between r min and r max from the origin. One (minor) di&#8629;erence is that we utilize the self-balancing walk of [ALS21] instead of algorithmic versions of Banaszczyk's theorem. This will lose a logarithmic factor in the size of the coreset, but has the benefit of being very simple.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Second, we use the technique of</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">New Feature Embeddings</head><p>In what follows, for any two thresholds 0 &lt; r in &lt; r out , we let Shell(r in , r out ) &#8674; R d be the set of vectors x 2 R d with r in &lt; kxk 2 &lt; r out . The main theorem of this section is the following.</p><p>Theorem 3.2. Let G : R 0 ! [0, 1] be such that K(x, y) = G(kx yk 2 2 ) is a p.d kernel for every R d . For any two thresholds 0 &lt; r in &lt; r out and &#8672; &gt; 0, there exists a map : R d \ Shell(r in , r out ) ! L 2 such that every x, y 2 R d where 0 &lt; kyk 2 &#63743; r in &lt; r out &#63743; kxk 2 satisfy:</p><p>We will prove Theorem 3.2 in the next subsection, but we note below that it directly implies an improvement on the 2 -norm of kernel matrices of p.d kernels (up to a small additive error &#8672;). In particular, Theorem 3.2 implies that for any sets P, Q &#8674; R d where P &#8674; B 2 (0, r in ) \ B 2 (0, r min ) and Q 6 &#8674; B 2 (0, r out ) (or vice-versa), there exists a |Q| &#8677; |P | matrix &#195; which is entry-wise &#8672;-close to the |Q| &#8677; |P | kernel matrix (K(q, p)) q2Q,p2P , and</p><p>Remark 3.1. (Using Smoothness to Relate 2 ( &#195;) to K(P, q)/|P |) Here, one can see where smoothness of the kernel becomes essential. As per (2.7), one can decrease the size of P while the error incurred from discrepancy, which is given by 2 ( &#195;)/|P |, is smaller than &#9999;&#8226;K(P, q)/|P |. Note that when p 2 P and q 2 Q are always at distance at most 2r max , we know K(p, q) G(4r 2 max ). Thus, we can continue decreasing the dataset while (1/|P |) &#8226; 2 ( &#195;) is significantly smaller than &#9999; &#8226; G(4r 2 max ), which occurs while</p><p>.</p><p>The smoothness comes in when computing the ratio of G(&#8226;)'s, since the smoothness allows us to relate the G(r 2 1 )/G(r 2 2 ) in terms of r 2 /r 1 . In particular, this is why extending our approach to kernels with faster decay requires interesting new ideas: for the Gaussian kernel, for example, the right hand side of the equation above is significantly larger than K(P, q)/|P |, as the &#8677;(1/ p d) factor stemming from the separation that our partitioning scheme ensures a&#8629;ects the exponent.</p><p>3.2 Proof of Theorem 3.2 We will construct the map explicitly. In order to do so, we first recall Schoenberg's characterization of p.d radial kernels, as well as the Haussdorf-Bernstein-Widder theorem, which we will use in constructing . We point to Chapter 7 of <ref type="bibr">[Wen04]</ref> for an extensive treatment of these topics.</p><p>2 ) is p.d if and only if the function G is completely monotone on R 0 , i.e., G 2 C 1 (R 0 ), and</p><p>for all `2 {0} [ N and t 0. We introduce the following notation:</p><p>We show how to map each point (x) and (y) when kxk 2 r out and kyk 2 &#63743; r in . First, we let u x , v y : [0, t 0 ] &#8677; ({0} [ N) ! R &#8676; (where R &#8676; consists of the union of all finite length tuples [ j 1 R j ) be the functions given by</p><p>The map (x) will consider the collection of functions {u x (&#8226;, k) : [0, t 0 ] ! R d k : k 1} and "concatenate them."</p><p>In particular, we let H denote the Hilbert space over functions g : [0,</p><p>Then, (x) and (y) are the functions where (x)(t, k, i) = u x (t, k) i and (y)(t, k, i) = v y (t, k) i . We note that the novelty in the above definitions is introducing the &#8674; and 1/&#8674; factors in u x and v y . In the absence of the &#8674;-and 1/&#8674;-factors, the proof that we produce would recover features embeddings of unit norm. By introducing these factors, we are able to exploit the fact that x and y have di&#8629;erent norms. In particular, these terms cancel out when computing hu x , v y i, but a&#8629;ect the norms ku x k 2 2 and kv y k 2 2 . For instance, we have</p><p>e tkx yk 2 2 &#181;(dt).</p><p>(3.12) By Theorem 3.4, we have</p><p>where we used the fact that kx yk 2 (r out r in ) by the triangle inequality and the fact that</p><p>Since G(&#8226;) is decreasing (because G is total monotone by assumption, so the derivative of G is always non-positive) and &#8674; &lt; 1, we have</p><p>The final upper bound on k (y)k 2 2 comes from the fact</p><p>Suppose the kernel K(x, y) = G(kx yk 2 2 ) is p.d for every R d , and in addition, is (L, t)-smooth. We will need one preliminary theorem which will follow from the (online) discrepancy minimization algorithm of <ref type="bibr">[ALS21]</ref>. Then, we state and prove Lemma 3.1, which is the main lemma of this section.</p><p>Theorem 3.5. (Self-Balancing Walk <ref type="bibr">[ALS21]</ref>) For any n, d 2 N, there exists a randomized algorithm which receives as input a set of vectors V = {v 1 , . . . , v n } 2 R d and a parameter &gt; 0. The algorithm outputs a (random) subset V 0 &#8674; V such that, for any vector u 2 R d , with probability at least 1 ,</p><p>Furthermore, the algorithm does not require explicit access to V ; it only requires oracle access to {hv i , v j i} i,j2 <ref type="bibr">[n]</ref> .</p><p>Proof. The statement of Theorem 3.5 above does not explicitly appear in <ref type="bibr">[ALS21]</ref>, but readily follows from the proof of Theorem 1.1 in their paper. In particular, <ref type="bibr">[ALS21]</ref> give a randomized algorithm, Balance, which receives as input a sequence of vectors v 1 , . . . , v n 2 R d of norm at most 1, and a failure probability . 9 The algorithm produces a sequence of (random) vectors w 0 = 0, w 1 , . . . , w n 2 R d such that for any vector u 2 R d with kuk 2 &#63743; 1, 10</p><p>In addition, as long as |hw i , v i+1 i| &#63743; c = 30 log(n/ ) for all i 2 [n], then there is a setting of signs 2 { 1, 1} n such that every i 2 [n] satisfies</p><p>As in their proof of Theorem 1.1, one may take a union bound over the n steps and conclude that |hw n , ui| &#63743; c except with probability at most . Theorem 3.5 stated above simply performs the above argument with u 0 = u/kuk 2 and v 0 i = v i / max i kv i k 2 to obtain the setting signs 2 { 1, 1} n , and sets V 0 to be those indices i 2 [n] where i = 1.</p><p>The main result of this section is Lemma 3.1. For L, t 1, let K : R d &#8677; R d ! [0, 1] be a p.d radial kernel which is (L, t)-smooth. There exists a randomized algorithm which receives as input a subset X &#8674; R d , four thresholds 0 &lt; r min &#63743; r in &lt; r out &#63743; r max , and three parameters &#9999;, &#8672;, 2 (0, 1). The algorithm outputs a random subset S &#8674; X and a number T 0 which satisfy the following conditions.</p><p>&#8226; For any q 2 R d , if kqk 2 2 [r out , r max ] and kxk 2 2 [r min , r in ] for every x 2 X, or kqk 2 2 [r min , r in ] and kxk 2 2 [r out , r max ] for every x 2 X, the following holds with probability at least 1 ,</p><p>&#8226; The size of the subset S is bounded by</p><p>Proof. Let : R d \ Shell(r min , r out ) ! L 2 be the map of Theorem 3.2 instantiated with the parameter &#8672;. The algorithm will proceed in iterations and, go through `2 {0, . . . , T } specifying the sets</p><p>9</p><p>We consider the minor modification to their algorithm, where the second condition of Line 4, that kw i 1 k1 &#63743; c is dropped, which allows us to set c = 30 log(n/ ), as they do in their proof of Theorem 1.2. We note that the reason the check kw i 1 k1 &#63743; c is present in their algorithm is because in online discrepancy, one wants to ensure that every coordinate of any partial sum of vectors is small, whereas we will only care about the final vector.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>10</head><p>The specific constant of 240&#8673; log(n/ ) follows their bound of 4Lc, since L = 2&#8673; and c = 30 log(n/ ).</p><p>&#8226; For `2 [T ], the set V `is set to the smallest of V 0 ` 1 or V ` 1 \ V 0 ` 1 , where V 0 ` 1 is the output of algorithm from Theorem 3.5 with input V ` 1 and failure probability set to /T . 11 By definition of the procedure above, its simple to see that |V T | &#63743; |X|/2 T . In order to show the approximation bound, we first note that Then, we also have</p><p>where the final line applies the upper bound of Theorem 3.5, as well as the fact that</p><p>from Theorem 3.2. Furthermore, we notice that kx qk 2 &#63743; 2r max for every x 2 X, which means K(x, q) G(4r 2 max ). Using the fact that K is (L, t)-smooth, we may always upper bound (3.14) by 0 @ 2 T log</p><p>and the above bound is smaller than &#9999; as long as we set</p><p>and results in a set S of size at most</p><p>The above lemma readily implies the following theorem, which we state so that we can directly invoke this in the subsequent sections. Specifically, Lemma 3.1 can produce a coreset S, so if we store the coreset points, the parameter T , and a translation vector c 2 R d . Lemma 3.2. Let n, d 2 N, and suppose that for L, t 1, K : R d &#8677; R d ! [0, 1] is a p.d radial kernel which is (L, t)-smooth. There exist two randomized algorithms with the following guarantees: 11 Even though the vectors (x) are infinite dimensional, there are only a finite set of vectors. In addition, Theorem 3.5 has no dependence on the dimensionality d. Thus, it su ces to implicitly work with the subspace spanned by { (x) : x 2 X} and (q).</p><p>&#8226; ProcessCaptured(X, c, r min , r in , r out , r max , &#9999;, &#8672;, ) receives as input a set X &#8674; R d of size at most n, a point c 2 R d , thresholds r min &lt; r in &lt; r out &#63743; r max 2 R 0 , error parameters &#9999;, &#8672; 2 (0, 1), and failure probability 2 (0, 1). We are promised that one of the following two hold</p><p>The algorithm outputs a pointer to a data structure v.</p><p>&#8226; QueryCaptured(q, v, c, r min , r in , r out , r max , &#9999;, &#8672;, ) receives as input a query q 2 R d , a pointer to a data structure v, a point c 2 R d , thresholds r min &lt; r in &lt; r out &lt; r max 2 R 0 , error parameters &#9999;, &#8672; 2 (0, 1), and failure probability 2 (0, 1). The algorithm outputs a value &#8984; 2 R 0 .</p><p>For any query q 2 R d , if kq ck 2 2 [r out , r max ] and (3.15) holds, or kq ck 2 2 [r min , r in ] and (3.16) holds, the following occurs with probability at least 1 over the randomness in the algorithm. We execute ProcessCaptured(X, c, r min , r in , r out , r max , &#9999;, &#8672;, ) and we let v denote the pointer to the data structure it outputs, and we let &#8984; be the output of QueryCaptured(q, v, c, r min , r in , r out , r max , &#9999;, &#8672;, ). Then,</p><p>&#8226; Correctness: The estimate &#8984; 2 R 0 that we output satisfies</p><p>&#8226; Time and Space Complexity: The algorithm ProcessCaptured(X, c, r min , r in , r out , r max , &#9999;, &#8672;, ) takes time at most polynd time to output a data structure v. 12 The total space of v, as well as the running time of QueryCaptured(q, v, c, r min , r in , r out , r max , &#9999;, &#8672;, ) is, up to a constant factor, at most</p><p>Remark 3.2. Lemma 3.2 assumes black-box access to dot products in the embedded space. This can typically be achieved by obtaining an analytic expression for the measure &#181; and integrating as per (3.12). For example, if K(x, y) = 1/(1 + kx yk 2 ), then G( ) = 1/(1 + ) in Theorem 3.4 and one has &#181;(t) = e t . Then h (x), (y)i can be evaluated per (3.12) at polylogarithmic cost.</p><p>4 Algorithmic Result: Data Structure for Evaluating the Coresets 4.1 A Ball Carving Hash For this section, we present a ball carving hash function. We let d 2 N will be the dimensionality of the space (which we view as a non-constant parameter), as well as a dataset P &#8674; R d of n points. The goal will be to provide a randomized partition of the dataset, such that whenever we use the coreset data structure of [PT20], we are doing so for points P (and potential queries Q) whose kernel matrix has a smaller 2 -norm. Hence, this section does not concern the specific kernel K, and will simply be a ball-carving hash function.</p><p>Lemma 4.1. There exists absolute constants c 1 , c 2 &gt; 0 such that, for any &#8629; 2 (0, c 2 ) and R &gt; 0, we have the following. There exists a distribution D supported on pairs (c, r) 2 R d &#8677; R 0 which specify a function h c,r : R d ! {0, 1, &#8676;} given by</p><p>The distribution satisfies the three guarantees:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>12</head><p>It is important here that the algorithm can e ciently compute h (x), (y)i for any x, y 2 X for the feature embeddings of Theorem 3.2. See Remark 3.2 for more details.</p><p>&#8226; Separate Far Points: For any two points x, y 2 B 2 (0, R) where kx yk 2 R/100, the probability over (c, r) &#8672; D that h c,r (x), h c,r (y) 2 {0, 1} and h c,r (x) 6 = h c,r (y) is at least c 1 / p d.</p><p>&#8226; Avoid Boundary: For any point x 2 B 2 (0, R), the probability over (c, r) &#8672; D that h c,r (x) = &#8676; is at most &#8629;/ p d.</p><p>&#8226; Far From Center: For any point x 2 R d , with probability at least 1 2 &#8998;(d) over the draw of (c, r) &#8672; D, kx ck 2 R/100, kck 2 &#63743; 2R.</p><p>Proof. The distribution D samples a point c and a threshold r by letting c &#8672; N (0, R 2 &#8226; I d /d) and r &#8672; [0, 3R]. The third item is the simplest: (i) by anti-concentration of N (0, R 2 &#8226; I d /d), the probability mass on any ball of radius R/100 is at most 2 &#8998;(d) , and (ii) by concentration of</p><p>The second item is also simple to argue: for any fixed c 2 R d , h c,r (x) = &#8676; whenever |r kx ck 2 | &#63743; &#8629;R/ p d. Therefore, the probability over the draw of r &#8672; [0, 3R] that the above occurs is at most &#8629;/ p d. We now argue the first item, that h c,r tends to separate far points. Consider a fixed setting of p, q 2 B 2 (0, R), and for a sample (c, r) &#8672; D, let := kp ck 2 kq ck 2 .</p><p>Then, the event that h c,r (p) 6 = h c,r (q) occurs whenever the following two events hold:</p><p>&#8226; Event A: For = 1/2, we have p, q 2 B 2 (c, (2 + )R), and</p><p>&#8226; Event B: The threshold r lies within an interval of 3R of length 2&#8629;R/ p d.</p><p>Notice that Event A holds with high probability because we may apply the triangle inequality and the fact p, q 2 B 2 (0, R). If event A fails, then the center point c must have Euclidean norm larger than (1 + )R, so</p><p>8 , using a standard concentration inequality on 2 (d) random variables (see, Example 2.5 in [Wai19]). Therefore, we have Pr [Events A and B hold] Pr [Event B holds] Pr [Event A fails] Pr [Event B holds] e d 2 /8 . (4.18)</p><p>So it remains to lower bound the probability over (c, r) &#8672; D that h c,r (p) 6 = h c,r (q), which is at least</p><p>We may lower bound the expectation of by considering whether or not event A holds. In particular, we may always lower bound, for any setting of the randomness of ,</p><p>The reason is the following: if event A holds, then kp ck 2 + kq ck 2 &#63743; 2(2 + )R and we obtain the desired lower bound. If event A fails, then since kpk 2 , kqk 2 &#63743; R,</p><p>so subtracting 1{Event A fails} &#8226; (R + kck 2 ) ensures that the right-hand side of (4.20) is negative. Hence,</p><p>We bound each term:</p><p>where by 2-stability of the Gaussian distribution, 2hp q, ci is distributed like N (0, 4kp qk 2 2 &#8226; R 2 /d), and anticoncentration of the Gaussian distribution. Recall that we have already upper bounded the probability that event A fails, and finally, we can similarly bound the final expectation,</p><p>using the fact = 1/2. Substituting into (4.19) into (4.18), we obtain</p><p>since kp qk 2 R/100, &#8629; is a small enough constant, and d is at least a large enough constant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Data Structure</head><p>In this section, we will use Lemma 4.1 to preprocess the dataset by recursively hashing. We will consider a dataset P &#8674; R d which consists of n points, an unknown query q 2 R d , and the parameter &gt; 1 which denotes the maximum aspect ratio of P [ {q}, i.e., = max x,y2P [{q} kx yk 2 min x6 =y2P [{q} kx yk 2 .</p><p>We assume that the dimensionality d is !(log n log log ) &#63743; d; the assumption is without loss of generality, as we may add coordinates which are set to 0. &#8226; Preprocess(P, &#9999;, &#8672;) receives as input a dataset P &#8674; R d of at most n points, and two error parameters &#9999;, &#8672; 2 (0, 1). The algorithm outputs a pointer to a data structure u.</p><p>&#8226; Query(q, u, &#9999;, &#8672;) receives as input a query q 2 R d , a pointer to a data structure u generated by Preprocess, and two error parameters &#9999;, &#8672; 2 (0, 1).</p><p>Suppose that P [ {q} has aspect ratio at most , then we satisfy the following guarantees with probability at least 0.9,</p><p>&#8226; Correctness: If u is the output of Preprocess(P, &#9999;, ) and &#8984; 2 R 0 is the output of Query(q, u, &#9999;, ).</p><p>Then, (1 &#9999;)</p><p>&#8226; Space Complexity: The total space of a data structure u produced by Preprocess(P, &#9999;, &#8672;) is at most, up to a constant factor, nd</p><p>&#8226; Query Time: Query(q, u, &#9999;, &#8672;) takes time at most, up to a constant factor,</p><p>&#8226; Preprocessing Time: Assuming access to oracles for computing h (x), (y)i for x, y 2 R d and constructed from Theorem 3.2, the algorithm Preprocess(P, &#9999;, &#8672;) runs in time polynd log ln(1/&#8672;)/&#9999;.</p><p>Remark 4.1. (Simplifications on the Notation and Error Probability) In the description of the algorithms Preprocess(P, &#9999;, &#8672;) and Query(q, u, &#9999;, &#8672;), we will make multiple calls to ProcessCaptured and QueryCaptured with various parameter settings and we will make the following simplifications. First, we will drop the dependence on &#9999; and &#8672; in the description below, as these will remain constant throughout the algorithm. Second, we will, at the forefront, set the error probability in calls to ProcessCaptured and QueryCaptured to 1/polydn /&#9999;. Thus, in the presentation below, we (i) simplify the notation by dropping the dependence on in calls to ProcessCaptured and QueryCaptured, and (ii) essentially assume that ProcessCaptured and QueryCaptured are deterministic algorithms. Indeed, the theorem that we seek to prove, Theorem 4.1, allows for polylog(dn /&#9999;) dependencies, and Lemma 3.2 has poly-logarithmic dependence on 1/ . Since the preprocessing time of Preprocess(P, &#9999;, &#8672;) and query time Query(q, u, &#9999;, &#8672;) which is at most polynomial in nd log , setting &#63743; 1/polydn allows one to union bound over all executions of ProcessCaptured and QueryCaptured so that they are all correct with high probability.</p><p>The remainder of this section gives the proof of Theorem 4.1. The preprocessing algorithm that we present below can be thought of as consisting of two main parts, where we prepare for an (unknown) query q 2 R d . The data structure will be stored as a binary tree where nodes will hold additional information. Specifically, a call to Preprocess(P ) instantiates a node u, and will have certain attributes stored in the node. The notation for these will be u.attr, for some "attribute" attr. Each non-leaf node u will have at most two children, which will be stored in u.Child(0) and u.Child(1). The formal description appears in Algorithm 4.2.</p><p>High Level Structure of Algorithm 4.2 The first thing the algorithm does is enclose the dataset P with an (approximately) minimum enclosing ball around a center point u.cen of radius u.rad. This is done in Line 3, and can be done in O(nd) since it will su ce to obtain a constant-factor approximation (for instance, picking an arbitrary point p 2 P and letting u.cen = p and u.rad = max p 0 2P kp p 0 k 2 su ces). If Line 5 is triggered, then there is at most 1 distinct point (which is stored as u.cen) and since the algorithm stores |P | in u.size, it will be able to compute the kernel contribution of P . The parameter u.R is set to 2 &#8226; u.rad and the algorithm will prepare for two cases: when the (unknown) query q 2 R d is "close" to u.cen (within distance at most u.R), and when the query q 2 R d is "far" from u.cen.</p><p>&#8226; Query Close: In this case, we will be preparing for a query q satisfying kq u.cenk 2 &#63743; u.R, so that we may instantiate Lemma 4.1 with the origin as u.cen and R as u.R. In this case, we can consider the hash function as dividing R d into three parts: (i) within distance u.r in of u.newcen (with high probability, we also have the distance will be at least u.r min as well), (ii) within the shell around u.newcen of inner-radius u.r in and outer-radius u.r out , and (iii) outside the ball around u.newcen of radius u.r out (and we will also have an upper bound on the distance to u.newcen of u.r max ). In this case, we build two coresets with ProcessCaptured for regions (i) and (iii). Then, we recursively preprocess the points which are not captured by the coresets and store these data structures as the children, u.Child(0) and u.Child(1).</p><p>&#8226; Query Far: In this case, we prepare for a query q which satisfies kq u.cenk 2 u.R. Since every point p 2 P lies within distance u.rad of u.cen and u.R = 2 &#8226; u.rad, we can already guarantee a separation of least u.rad between q and any p 2 P . Formally, we will handle this in Line 15 with the sub-routine PreprocessFar (which we specify shortly). This case will not recursively call Preprocess.</p><p>As we will formally show below, recursively applying the ball-carving hash function result in calls to Preprocess with datasets of decreasing radii. This is because Lemma 4.1 guarantees that we separate far points with at least some probability. Eventually the radius becomes 0 and Line 5 ends the recursion.</p><p>Algorithm 1 Preprocessing of a dataset P &#8674; R d into a data structure.</p><p>1: procedure Preprocess(P )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2:</head><p>Initialize a node u.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3:</head><p>Let MEB(P ) &#8674; R d denote an (approximately) minimum enclosing ball of P . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>9:</head><p>We defined the following subsets of P (which do not necessarily partition P ): Run ProcessCaptured(Cap(b), u.newcen, u.r min , u.r in , u.r out , u.r max ).</p><p>12:</p><p>Store the output in u.InnerBall(b).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>13:</head><p>if P \ Cap(b) 6 = ; then 14:</p><p>Execute Preprocess(P \ Cap(b)) and store the output in u.Child(b).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>15:</head><p>Run PreprocessFar(P, u.cen, u.rad, u.R) and store in u.FarDS.</p><p>16: return u.</p><p>Algorithm 2 Querying a data structure rooted at u with a point q 2 R d .</p><p>1: procedure Query(q, u)</p><p>We evaluate the hash function h u.c,u.r (q u.cen), and set b 2 {0, 1, &#8676;} to be its output.</p><p>4: if b = &#8676; then 5:</p><p>return "fail."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6:</head><p>Execute QueryCaptured(q, u.InnerBall(b), u.newcen, u.r min , u.r in , u.r out , u.r max )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>7:</head><p>Execute Query(q, u.Child(b)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>8:</head><p>return the sum of the outputs of Line 6 and Line 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>9:</head><p>if kq u.cenk 2 &gt; u.R then 10:</p><p>return QueryFar(q, u.FarDS, u.cen, u.R).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">The PreprocessFar and QueryFar Algorithms</head><p>In this section, we give the descriptions of PreprocessFar (Algorithm 3) and QueryFar (Algorithm 4). These are meant to capture cases where the query lies substantially far from the dataset. The approach will be straight-forward: we will partition the (query) space R d into geometrically increasing balls (up to a certain point), and maintain a coreset for each of these balls. The remaining piece is to bound how many balls we need.</p><p>Definition 4.1. Let K : R d &#8677; R d ! [0, 1] be a p.d radial kernel. For any r 0 and &#9999;, &#8672; 2 (0, 1), let R(r, &#9999;, &#8672;) &gt; 0 denote the minimum over all R 0 such that, for all datasets P &#8674; R d where P &#8674; B 2 (0, r), and for all queries q 2 R d with kqk 2 R,</p><p>4.1. We always have R(r, &#9999;, &#8672;) = O(r ln(1/&#8672;)/&#9999;). Proof. By Theorem 3.4 there exists a non-negative finite Borel measure &#181; such that (4.21) K(p, q) = Z 1 t:0 e tkp qk 2 2 &#181;(dt) for all p, q 2 R d . We consider the parameter R = Cr ln(1/&#8672;)/&#9999;, for large enough constant C, and we show that enforcing kqk 2 R su ces. Note that R 2r, and that for any p 2 B 2 (0, r), the following occurs. If we consider a query q 2 R d with kqk 2 R, we always have kp qk 2 kqk 2 kpk 2 kqk 2 /2. Consider first the case that t 0 satisfies, tkqk 2 2 4 ln(1/&#8672;). Then, we will have (4.22) e tkp qk 2 2 &#63743; e tkqk 2 2 /4 &#63743; &#8672;. On the other hand, if t 0 is such that tkqk 2 2 &#63743; 4 ln(1/&#8672;), then t &#63743; 4 ln(1/&#8672;) kqk 2 2 &#63743; 4&#9999; 2 C 2 r 2 ln(1/&#8672;) and tkqk 2 &#63743; 4&#9999; Cr , which implies the following two inequalities: e tkp qk 2 2 e tkqk 2 2 = e tkpk 2 2 +2thp,qi &#63743; e 2trkqk2 &#63743; e 8&#9999;/C &#63743; 1 + &#9999;/2 e tkpk 2 2 +2thp,qi e tr 2 2trkqk2 e 4&#9999; 2 /(C 2 ln(1/&#8672;)) 8&#9999;/C 1 &#9999;/2.</p><p>Putting both cases together, whenever kqk 2 R, we have</p><p>and analogously,</p><p>Re-arranging terms gives the desired inequality.</p><p>Algorithm 3 Preprocessing of a dataset P &#8674; R d within a ball when the query will be far.</p><p>1: procedure PreprocessFar(P, c, r in , r)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2:</head><p>Initialize a node u, and sample a point c 0 &#8672; B 2 (0, r in /3) and let</p><p>for h 2 {0, . . . , dlog 2 (R(u.r, &#9999;, &#8672;)/u.r)e} do 4:</p><p>Execute ProcessCaptured(P, u.cen, u.r min , u.r in , 2 h &#8226; u.r, 2 h+1 &#8226; u.r),</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:</head><p>Store the output in u.OuterBall(h).</p><p>6:</p><p>return u.</p><p>Algorithm 4 Querying of a dataset P &#8674; R d within a ball when the query is far.</p><p>1: procedure QueryFar(q, u, c, r)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2:</head><p>Compute r = kq u.cenk 2 (note we will always have r u.r).</p><p>3:</p><p>return QueryCaptured(q, u.OuterBall(h), u.r min , u.r in , 2 h &#8226; u.r, 2 h+1 &#8226; u.r).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6:</head><p>if h &gt; dlog 2 (R(u.r, &#9999;, &#8672;)/u.r)e then 7:</p><p>return u.size &#8226; K(q, u.cen).</p><p>Lemma 4.2. Let n, d 2 N, and suppose that, for L, t 1, K : R d &#8677; R d ! [0, 1] is a p.d radial kernel which is (L, t)-smooth. There are two randomized algorithms with the following guarantees:</p><p>&#8226; PreprocessFar(P, c, r in , r) receives as input a set P &#8674; R d of size at most n, a point c, and two thresholds r in and r, where r 2 &#8226; r in . We are promised that every p 2 P satisfies kp ck 2 &#63743; r in . The algorithm outputs a pointer to a data structure u.</p><p>&#8226; QueryFar(q, u, c, r) receives as input a query q 2 R d , and a pointer to a data structure (generated from PreprocessFar). We are promised that the query q satisfies kq ck 2 r, and the algorithm returns a value &#8984; 2 R 0 .</p><p>We satisfy the following guarantees with probability at least 1 1/polynd /&#9999;:</p><p>&#8226; Correctness: If u is the output of PreprocessFar(P, c, r in , r) and &#8984; is the output of QueryFar(q, u, c, r), then (1 &#9999;) Proof. The purpose of Line 2 is to find an appropriate center which will guarantee some separation between the query and the dataset, and to ensure that no dataset point is too close to the center. In particular, since c 0 &#8672; B 2 (0, r in /2) is drawn randomly and d = !(log n log log )), it is not too hard to check that with high probability, the new center u.newcen and thresholds u.r min &lt; u.r in &lt; u.r satisfy that (i) u.r u.r in u.r in /4, (ii) every p 2 P satisfies kp u.cenk 2 2 [u.r min , u.r in ], and (iii) every q 2 R d with kq ck 2 r will also satisfy kq u.cenk 2 u.r. Thus, the calls to ProcessCaptured(P, u.cen, u.r min , u.r in , 2 h &#8226; u.r, 2 h+1 &#8226; u.r) and QueryCaptured(q, u.OuterBall(h), u.r min , u.r in , 2 h &#8226; u.r, 2 h+1 &#8226; u.r) satisfy the correctness guarantees for queries q with kq u.cenk 2 2 [2 h &#8226; u.r, 2 h+1 &#8226; u.r]. If kq u.cenk 2 R(u.r, &#9999;, &#8672;), then by Definition 4.1, the entire kernel contribution of P is approximated by K(q, u.cen) &#8226; u.size. The bound on the space and query time follow from Lemma 3.2, plugging in r max = 2 h+1 &#8226; u.r, r out = 2 h &#8226; u.r, r in = u.r in and r min = u.r min With those remarks set, we now analyze the space complexity Preprocess(P ). Since we will later bound the space complexity of ProcessCaptured, we will mostly be concerned with the size of the binary tree rooted at u produced by Preprocess(P ). The one challenging aspect in bounding the size of the tree is that the two sets P \Cap(0) and P \Cap(1) are not necessarily disjoint. In particular, if p 2 P happens to satisfy h c,r (p u.cen) = &#8676;, then p is contained in both sets. This will mean that the tree may be super-linear in size (if we are not careful about the setting of &#8629;). We will first bound the depth of the tree. Proof. We will show that the following occurs with high probability, which in turn implies the desired bound on the depth. Let u be any node of the tree and P u &#8674; P such that u was the output of Preprocess(P u ). Then, over the randomness of the next s levels down the tree, we consider any path u = u 0 , u 1 , . . . , u s down the tree where u `is the child of u ` 1 , and we have u s .rad &#63743; u.rad/2. Suppose that we set s such that the above event occurs with probability at least 1 1/o(log ) (so we can union bound over O(log ) such events). Then, every s levels of the tree, the values of u.rad decrease by a factor of 2, and by the definition of the aspect ratio , there are at most O(s log ) recursive levels before Line 6 always returns.</p><p>We now show that we may set s = &#8677;( p d log(n log )) in the above argument. Consider fixing any two points p 1 , p 2 2 P u . Let T denote the random sub-tree rooted u which recursively contains children v generated from calls Preprocess(P v ) in Line 14 where p 1 , p 2 2 P v . We notice that v.rad is always at most 2 &#8226; u.rad (because P v &#8674; P u and the factor of 2 occurs because we may choose a di&#8629;erent center of MEB(P u ) and MEB(P v )), so that in o(1). At the same time, the depth of the tree is at most O( p d log(n log ) log ) = o( p d/&#8629;) with high probability. Taking a union bound, we have that we reach the end of the tree and we do not output "fail" with high probability.</p><p>The other necessary event to check is that, if v is a node on the root-to-leaf path in an execution of Query(q, u), and the call which initialized v was Preprocess(P v ), then we always have every p 2 P v satisfies kp v.newcenk 2 v.R/100 and kq v.newcenk 2 v.R/100. For any fixed node, the probability that this occurs is at least 1 (n + 1)2 &#8998;(d) , so we can again union bound over the first O( p d log(n ) log ) levels of the tree when d = !(log n log log ).</p><p>Finally, we note that in any execution of Query(q, u), the output is given by a sum of at most |P | executions of QueryCaptured, so that the additive errors accumulate to at most |P |.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded 08/01/24 to 50.207.99.87 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy</p></note>
		</body>
		</text>
</TEI>
