<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Active Linear Regression for Lp Norms and Beyond</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 October</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10403837</idno>
					<idno type="doi">10.1109/FOCS54457.2022.00076</idno>
					<title level='j'>IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS 2022))</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Cameron Musco</author><author>Christopher Musco</author><author>David P. Woodruff</author><author>Taisuke Yasuda</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector and output a near minimizer to the objective function.For p norm regression for any 0 < p < ∞, we give an algorithm based on Lewis weight sampling which outputs a (1 + )-approximate solution using just Õ(d/ 2 ) queries to b for p ∈ (0, 1), Õ(d/ ) queries for p ∈ (1, 2), and Õ(d p/2 / p ) queries for p ∈ (2, ∞). For p ∈ (0, 2), our bounds are optimal up to logarithmic factors, thus settling the query complexity for this range of p. For p ∈ (2, ∞), our dependence on d is optimal, while our dependence on is off by at most a single factor, up to logarithmic factors. Our result resolves an open question of Chen and Derezi ński, who gave near optimal bounds for the 1 norm, but required at least d 2 / 2 samples for p regression with p ∈ (1, 2), and gave no bounds for p ∈ (2, ∞) or p ∈ (0, 1).We also provide the first total sensitivity upper bound for loss functions with at most degree p polynomial growth. This improves a recent result of Tukan, Maalouf, and Feldman. By combining this with our techniques for p regression, we obtain the first active regression algorithms for such loss functions, including the important cases of the Tukey and Huber losses. This answers another question of Chen and Derezi ński. Our sensitivity bounds also give improvements to a variety of previous results using sensitivity sampling, including Orlicz norm subspace embeddings, robust subspace approximation, and dimension reduction for smoothed p-norms.Finally, our active sampling results give the first sublinear time algorithms for Kronecker product regression under every p norm. Previous results required reading the entire b vector in the kernel feature space. 1]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>We consider a classic active learning problem: given a design matrix A &#8712; R n&#215;d and query access to entries of an unknown target (measurement) vector b &#8712; R n , how can we compute an approximate minimizer of the regression problem min x&#8712;R d Ax -b while querying as few entries of b as possible? This problem arises in applications where labeled data is expensive: viewing a single entry of b might require running a survey, physical experiment, or time-intensive com-Cameron Musco's work on this project was supported in part by NSF Grants 2046235 and 1763618, along with an Adobe Research Grant. Christopher Musco was supported by NSF Grant 2045590. David P. Woodruff and Taisuke Yasuda were supported by ONR grant N00014-18-1-2562 and a Simons Investigator Award. 1 Extended abstract; full version available at <ref type="url">https://arxiv.org/abs/2111</ref>. 04888.</p><p>puter simulation <ref type="bibr">[44]</ref>, <ref type="bibr">[45]</ref>. Concretely, we study the following problem for general vector norms 2 &#8226; : Problem I.1. For A &#8712; R n&#215;d , b &#8712; R n , and accuracy parameter 0 &lt; &#8804; 1, find x &#8712; R d satisfying:</p><p>Ax -b , while reading as few of the entries {b(1), . . . , b(n)} of the target vector b as possible. 3   Notably, the formulation of Problem I.1 makes no assumptions on A and b. For example, we do not assume that there exists a ground truth x and that Ax -b is bounded in magnitude, or follows some distribution (e.g., has random Gaussian entries). Under these stronger assumptions, much is known about the problem, which has been studied for decades in the statistics literature on "optimal design of experiments", as well as in machine learning <ref type="bibr">[9]</ref>, <ref type="bibr">[33]</ref>, <ref type="bibr">[44]</ref>.</p><p>In contrast, progress on the assumption-free version of the problem has only come in recent years, thanks to advances in random matrix theory and randomized numerical linear algebra. This is for good reason: solving Problem I.1 inherently requires choosing which entries of b to query in a randomized way: an adversary can easily "fool" any deterministic algorithm by concentrating error in Ax -b on the indices of b that will be deterministically queried.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Prior Work</head><p>Euclidean Norm. Problem I.1 is fully understood when the error is measured in the 2 norm, w 2 = n i=1 |w i | 2 1/2 -i.e., for least squares regression. The typical approach is to subsample and reweight rows (i.e., constraints) of the regression problem and to let x be the minimizer of this sampled problem, which only involves a fraction of the entries in b. I.e., letting S &#8712; R m&#215;n be a sampling matrix with m &lt; n rows (S has one non-zero entry per row), set x = arg min x SAx -Sb . When constraints are selected with probability proportional to the statistical leverage scores of A's rows, Problem I.1 can be solved with O(d/ &#8226; log d) 2 Our work will also extend to other loss functions of the form n i=1 M ([Ax -b] i ) that are not necessarily norms. 3 In principal, entries of b can be read adaptively -i.e., we can select indices to query based on the results of other queries. However, the benefits of adaptivity appear limited. Most methods for solving Problem I.1 and those studied in this paper are non-adaptive. samples, and thus O(d/ &#8226; log d) queries to b <ref type="bibr">[21]</ref>, <ref type="bibr">[46]</ref>, <ref type="bibr">[53]</ref>. <ref type="foot">4</ref>Using tools from spectral graph sparsification <ref type="bibr">[6]</ref>, <ref type="bibr">[35]</ref>, Chen and Price recently improved the leverage score sampling result to O(d/ ), which is optimal <ref type="bibr">[12]</ref>.</p><p>In practice, methods based on leverage score sampling (also known as "coherence motivated sampling") have found many applications. They are widely used in high-dimensional function fitting problems arising in the solution of parametric partial differential equations, where even mild assumptions on A and b are undesirable <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[30]</ref>. Methods for solving Problem I.1 in the 2 norm also yield robust methods for interpolating sparse Fourier functions, bandlimited and multiband functions, and for data-efficient kernel learning <ref type="bibr">[4]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[23]</ref>.</p><p>Other Norms. Much less was known about Problem I.1 beyond the 2 norm until recent work of Chen and Derezi&#324;ski <ref type="bibr">[10]</ref>, which proves an upper bound of</p><p>This result is tight up to the log d factor. A similar result is obtained in <ref type="bibr">[43]</ref>. Chen and Derezi&#324;ski also prove a result for p norms, w p = (</p><p>As for the 2 norm, the results for 1 and p are obtained by subsampling rows of the regression problem independently at random. However, instead of sampling with probabilities proportional to the leverage scores, <ref type="bibr">[10]</ref>, <ref type="bibr">[43]</ref> employ a natural generalization of these scores known as the p Lewis weights <ref type="bibr">[18]</ref>. They left open the question of whether a linear in d dependence is possible for 1 &lt; p &lt; 2, and any bounds at all for p &gt; 2.</p><p>Beyond norms, if b is a {-1, 1} label vector, and the error is measured via the logistic loss, Munteanu et al. <ref type="bibr">[42]</ref> show that poly(d, &#956;, 1/ ) samples suffice, where &#956; is a complexity measure of A. This bound has recently been tightened to &#213;(d&#956; 2 / 2 ) [40], using Lewis weight sampling. <ref type="foot">5</ref> For other loss functions, such as the Tukey loss and Huber's M -estimators for robust regression <ref type="bibr">[27]</ref>, we are not aware of any known results solving Problem I.1. Chen and Derezi&#324;ski also pose the open question of obtaining active regression bounds for other loss functions, in particular the Tukey and Huber losses, which are important in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Our Contributions a) p Active Regression.:</head><p>Our first main result is a new algorithm for solving Problem I.1 for the p norm for any 0 &lt; p &lt; &#8734;<ref type="foot">foot_2</ref> . While near-optimal bounds are known for p &#8712; {1, 2} <ref type="bibr">[10]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[43]</ref>, the problem is far from settled for all other p. Previously, active p regression for p &gt; 2 and 0 &lt; p &lt; 1 had no known nontrivial algorithms with (1 + ) relative error, and the only known approach was to read all n entries of b and solve the problem using offline results. A natural question is whether a sublinear query complexity is possible in these regimes. For p &#8712; (1, 2), <ref type="bibr">[10]</ref> achieved an algorithm making O(d 2 / 2 &#8226;log d) queries, thus achieving the first sublinear query complexity. One of their main open questions is whether the dependence on d can be improved to linear or not. Our main result answers all of these questions.</p><p>Theorem I.2 (Main Result for Active p Regression). Given 0 &lt; p &lt; &#8734;, A &#8712; R n&#215;d , and query access to b &#8712; R n , there is an algorithm that solves Problem I.1 for the p -norm with probability 99/100 which makes m queries in b, where</p><p>.</p><p>We complement our algorithmic result with various new lower bounds which show the tightness of our algorithm. For p &#8712; (0, 2), our dependence on d and in the query complexity are simultaneously tight up to polylogarithmic factors; we show an &#937;(d/ 2 ) lower bound for p &#8712; (0, 1) and an &#937;(d/ ) lower bound for p &#8712; (1, 2). For p &gt; 2, our dependence on d is tight due to a lower bound of &#937;(d p/2 ) which we show, while our dependence is off by at most factor of due to an &#937;( 1-p ) lower bound for the one-dimensional p power means problem in Theorem 3 of <ref type="bibr">[19]</ref>. Note that our active regression lower bounds for p &#8712; (0, 2) improve this previous power means lower bound.</p><p>Notably, we achieve a linear dependence on for p &#8712; (1, 2), which is perhaps surprising given that all previous known approaches to dimension reduction for p regression relied on preserving the p norm of all vectors in a subspace up to (1&#177; ) factors <ref type="bibr">[18]</ref>, which requires &#937;(d/ 2 ) dimensions <ref type="bibr">[37]</ref>. It also demonstrates a separation in the query complexity for p &#8804; 1 and 1 &lt; p &lt; 2, due to a lower bound of &#937;(d/ 2 ) for p = 1 <ref type="bibr">[10]</ref>, <ref type="bibr">[43]</ref> as well as for p &#8712; (0, 1) which we show.</p><p>Note that Theorem I.2 is stated to solve Problem I.1 with constant probability, 99/100. In general, we show how to obtain 1&#948; probability with dependence on &#948; that is only polylogarithmic in 1/&#948;. In fact, we show that any algorithm that simply samples rows of the regression problem and solves the sampled problem must suffer a 1/&#948; p-1 dependence. Indeed, such a loss is seen in the algorithm of <ref type="bibr">[10]</ref> for p &#8712; (1, 2). Thus, a success probability boosting routine, as we give in our work, is required to obtain an O(log(1/&#948;)) dependence.</p><p>b) Sensitivity Bounds and Active Regression for General Losses.: We show that our approach to solving Problem I.1 for p norms generalizes to a broad class of loss functions known as M -estimators <ref type="bibr">[15]</ref>, which take the form</p><p>The only properties that we require are that we can (1) compute a constant factor approximation to Problem I.1 (2) the loss function obeys approximate variants of the triangle  <ref type="formula">1</ref>)).</p><p>To the best of our knowledge, the only prior result achieving sensitivity bounds for general loss functions is <ref type="bibr">[52]</ref>. However, this work makes use of L&#246;wner-John ellipsoids, which leads to practically inefficient algorithms, and loses a factor of &#8730; d in the total sensitivity due to the ellipsoidal rounding. As our second main result, we develop new sensitivity bounds for Mestimators that significantly simplify and improve this result.</p><p>Theorem I.3 (Main Result for Sensitivity Bounds). Let A &#8712; R n&#215;d and let M be an M -estimator loss with at most degree p growth. There is an algorithm which, with probability at least 99/100, computes M -sensitivity upper bounds which sum to at most O(d 1&#8744;(p/2) log 2 n + &#964; )<ref type="foot">foot_3</ref> in time at most &#213;(nnz(A) + nd O(1) /&#964; ).</p><p>Our approach to sensitivity bounds only relies on hashing and the computation of p Lewis weights <ref type="bibr">[18]</ref>, <ref type="bibr">[24]</ref>, and avoids the computation of L&#246;wner-John ellipsoids. This allows for input sparsity time algorithms, and answers an open question of <ref type="bibr">[52]</ref> on avoiding L&#246;wner-John ellipsoids in the computation of sensitivities. Note that our dependence on d matches the sensitivity bounds for the p loss and is thus tight. We also show that the dependence on n is necessary for loss functions such as the Huber and Tukey losses. Furthermore, our algorithm can be turned into a non-algorithmic proof that the sensitivities sum to at most O(d 1&#8744;(p/2) log n) for these Mestimators; this is in fact tight for the Tukey loss by our lower bound of &#937;(d log n). Thus, we obtain the first tight bounds on the sum of sensitivities, for losses other than p . Overall, we make significant progress on generalizing the theory of matrix approximation beyond p losses to handle general Mestimators, which is a direction that has recently received much attention <ref type="bibr">[13]</ref>- <ref type="bibr">[15]</ref>, <ref type="bibr">[26]</ref>, <ref type="bibr">[51]</ref>, <ref type="bibr">[52]</ref>.</p><p>Combined with our active regression techniques, our sensitivity bounds yield active regression algorithms for general loss functions, including the Huber and Tukey losses, answering an open question of Chen and Derezi&#324;ski <ref type="bibr">[10]</ref>. Note that prior to our work, no sublinear query complexity was known for any M -estimator regression, besides the 2 and 1 losses.</p><p>Furthermore, our new sensitivity bounds imply significant improvements in previous results using sensitivity sampling, beyond active regression, including Orlicz norm subspace embeddings <ref type="bibr">[50]</ref> and robust subspace approximation <ref type="bibr">[14]</ref>. We believe that our general technique here will find other further applications, and leave it as an open question to do so. c) Subspace Embeddings for Orlicz Norms.: Orlicz norms can be viewed as scale-invariant extensions of Mestimators, and have recently attracted attention as a general class of norms that admit efficient dimensionality reduction results <ref type="bibr">[3]</ref>, <ref type="bibr">[50]</ref>. In particular, <ref type="bibr">[50]</ref> apply sensitivity sampling to obtain subspace embeddings for Orlicz norms, which yields a small weighted subset &#195; of rows of a matrix A &#8712; R n&#215;d such that &#195;x = (1 &#177; ) Ax<ref type="foot">foot_4</ref> for all x &#8712; R d . However, the number of rows required by <ref type="bibr">[50]</ref> is a large polynomial in d, and is also restricted to Orlicz norms of at most quadratic growth. We show that by applying our new sensitivity bounds, we can obtain subspace embeddings for Orlicz norms with d 2&#8744;(p/2+1) poly(log n, -1 ) rows, for any Orlicz norm with a polynomial growth bound of degree p. d) Robust Subspace Approximation.: The robust subspace approximation problem generalizes the classical low rank approximation problem of finding a rank k projection X minimizing AX -A F by replacing the Frobenius norm with an extension of M -estimators to matrix norms. <ref type="bibr">[14]</ref> showed the first dimensionality reduction results for this problem for a general class of M -estimators of at most quadratic growth via a recursive sampling scheme using the sensitivity sampling framework. However, due to the use of looser sensitivity bounds, they suffer an undesirable factor of (log n) O(log k) in their sample complexities. Our new sensitivity bounds allow us to remove this factor, giving a dimension reduction result into a poly(k, log n, -1 )&#215;poly(k, log n, -1 ) instance. We also extend their method beyond quadratic growth, to any degree p polynomial growth.</p><p>e) Active Regression for the Huber Loss.: Our active regression result for general M -estimators discussed above is loose by a factor of d in the sample complexity, compared to our p active regression results. This is attributed to the use of our net argument for general M -estimators, whereas our p active regression results can make use of more sophisticated chaining arguments of <ref type="bibr">[7]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[48]</ref>. A natural question is if this gap can be improved.</p><p>We consider the important special case of the Huber loss, which is defined as follows:</p><p>Definition I.4 (Huber loss <ref type="bibr">[31]</ref>). The Huber loss of width &#964; &#8805; 0 is defined as</p><p>and the Huber norm<ref type="foot">foot_6</ref> is defined as y H := n i=1 H(y(i)). The Huber loss is "arguably one of the most widely used M -estimators" <ref type="bibr">[15]</ref>, owing its popularity to its convexity and differentiability properties of 2 , which allows for efficient algorithms (see, e.g., <ref type="bibr">[41]</ref> for algorithms), in combination with its robustness properties of 1 <ref type="bibr">[29]</ref>. This makes it widely applicable in practical big data settings (see, e.g., <ref type="bibr">[5]</ref> for a list of popular software packages implementing Huber regression as well as references that make use of Huber regression). Variations on Huber regression have also recently been shown to hold theoretical guarantees in the robust statistics literature (see, e.g., <ref type="bibr">[38]</ref>, <ref type="bibr">[39]</ref> and references therein).</p><p>For the Huber loss, we show that it is indeed possible to leverage the chaining techniques in order to obtain improved sample complexity bounds for active regression. We show that we can improve beyond the d 2 bound obtained by our general M -estimator algorithm as applied to the Huber loss, and obtain a sample complexity of O(d 4-2 &#8730; 2 poly(log n, -1 )) queries to b, where 4 -2 &#8730; 2 &#8776; 1.17157. For this result, we use the chaining techniques of <ref type="bibr">[7]</ref>, which provides a more flexible alternative to <ref type="bibr">[34]</ref>, but requires more technical effort to adapt to the active setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem I.5 (Main Result for Huber Active Regression). Let</head><p>There is an algorithm which, with probability at least 99/100, returns a x satisfying</p><p>Our techniques also yield a subspace embedding result, which constructs a weighted subset</p><p>Previously, the best known dimension reduction bound for Huber regression, even in the non-active setting, was d 4 due to <ref type="bibr">[15]</ref>.</p><p>Furthermore, this is, to the best of our knowledge, the first example of a loss function other than p which achieves a sensitivity sampling bound of better than d 2 , despite the fact that such results have been sought in many works <ref type="bibr">[13]</ref>- <ref type="bibr">[15]</ref>, <ref type="bibr">[28]</ref>, <ref type="bibr">[50]</ref>, <ref type="bibr">[52]</ref>. The reason for this is that d 2 is a natural bound for sensitivity sampling, attributed to one d factor from the sum of sensitivities and one d factor from carrying out a union bound over a net of exp(d) vectors. For p norms, the arguments of <ref type="bibr">[7]</ref>, <ref type="bibr">[48]</ref> and their subsequent improvements avoid this problem by using a more sophisticated chaining argument. However, these arguments use the structure of p spaces in crucial ways, such as isometric changes of density using Lewis weights <ref type="bibr">[32]</ref>, and do not generalize easily to other loss functions.</p><p>It is an interesting open question to determine whether our dimension reduction bound for the Huber loss can be improved all the way down to d.</p><p>f) Dimension Reduction for Gamma Functions for Faster p Regression.: One particularly important application of sampling-based dimension reduction for loss functions beyond p losses is, perhaps surprisingly, in the design of algorithms for p regression. The work of <ref type="bibr">[8]</ref> introduces gamma functions &#947; p , which are generalizations of the Huber loss which behave quadratically near the origin and like |x| p away from the origin, in the context of algorithms for p regression. Subsequently, <ref type="bibr">[2]</ref> obtained even faster algorithms by using constant factor approximations of &#947; p regression as a subroutine, in which the &#947; p loss is minimized over a subspace. Dimension reduction for this loss function has been a crucial ingredient for recent results in fast algorithms for p regression <ref type="bibr">[1]</ref>, <ref type="bibr">[28]</ref>. In particular, <ref type="bibr">[1]</ref> highlighted the open question of designing sparsification methods for &#947; p functions for p &#8712; (1, 2), and <ref type="bibr">[28]</ref> designed a sampling algorithm which samples &#213;(d 3 ) rows. By generalizing our dimension reduction techniques for the Huber loss, we obtain an algorithm which samples at most O(d 4-2 &#8730; 2 poly(log n, -1 )) rows for any p &#8712; [1, 2), and improves to O(d poly(log n, -1 )) rows as p &#8594; 2 (see Figure <ref type="figure">1</ref> for the trade-off curve).</p><p>g) Kronecker Product Regression.: Beyond applications in data-efficient regression, Theorem I.2 implies the first sublinear time algorithm for Kronecker product regression in any p norm, where explicitly constructing the vector b is a computational bottleneck. In q-th order Kronecker product regression, one is given matrices A 1 , A 2 , . . . , A q , where A i &#8712; R ni&#215;di , as well as a vector b &#8712; R n1n2&#8226;&#8226;&#8226;nq , and the goal is to solve:</p><p>where &#8855; denotes the Kronecker product. Typically q i=1 d i is much less than q i=1 n i , and the goal is to obtain algorithms that do not explicitly form A 1 &#8855; A 2 &#8855; &#8226; &#8226; &#8226; &#8855; A q or b, which is too expensive. Our results yield the first algorithm for Kronecker product regression, for every p &#8805; 1, whose running time does not depend on nnz(b), whereas previous results had a linear dependence on nnz(b), which can be as large as q i=1 n i <ref type="bibr">[22]</ref>. Theorem I.6. Let q &#8805; 1, p &#8805; 1 be constant, and &gt; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Kronecker product regression can be solved up to a (1 +</head><p>)-factor with constant probability in &#213;( <ref type="bibr">1)</ref> p Active Regression: Our algorithm for solving Problem I.1 uses a novel variation on the "sample-and-solve" approach. In particular, we randomly select a row sampling matrix S &#8712; R m&#215;n and return x = arg min x SAx -Sb , which only requires querying m entries of b (those that appear in Sb). To get tight bounds for p regression, we select S using p Lewis weight sampling, a generalization of leverage score sampling for 2 .</p><p>It can be shown that the p Lewis weights upper bound the p sensitivities of A, a measure of importance for the rows of A. The p sensitivity of the i th row of A is defined as</p><p>where [Ax](i) denotes the i th entry of the vector Ax, and captures how large the i th entry of any Ax &#8712; span(A) can be, relative to the p norm. A standard scalar Bernstein bound shows that if S samples rows with probabilities that upper bound the sensitivities, then SAx p p = (1 &#177; ) Ax p p with high probability, for each x &#8712; R d . An -net argument can extend this to a for all claim.</p><p>a) Prior Approaches to p Active Regression: While the above ideas give an approach for standard p regression, this bound does not suffice for active p regression. To solve Problem I.1, we actually want that S(Ax -b) p p = (1 &#177; ) Ax-b p p for any x. Will S provide such a guarantee? The main problem, as discussed in <ref type="bibr">[10]</ref>, <ref type="bibr">[43]</ref> is that the translation by b may introduce outliers, i.e., entries with high sensitivity which are not captured by the sensitivity scores of A. As shown by <ref type="bibr">[10]</ref>, <ref type="bibr">[43]</ref>, in the case of 1 , the special structure of the loss function provides a solution. Indeed, by the triangle inequality,</p><p>where x * is the optimal solution. This fact can be used to show that sampling by the sensitivities of A preserves the differences between the cost of any x and the optimal x * . However, such a proof cannot work for p = 1, in which case we do not have such a nice inequality. p &#8712; (1, 2), <ref type="bibr">[10]</ref> take the approach of bounding the residual error terms from the above approach by using a Taylor approximation, but this leads to a sample complexity of at least d 2 .</p><p>b) Our Solution: Partitions by Sensitivity: Instead of relying on the technique of "cancelling out the outliers", we take a conceptually different approach. We proceed in two stages, where we (1) first find a constant factor solution x c such that Ax c -b p p &#8804; O(1) &#8226; min x Ax -b p p using an idea of <ref type="bibr">[20]</ref> and replace b by the residual vector b-Ax c , and then (2) conceptually partition the target vector b into two sets of coordinates, the coordinates i &#8712; [n] that are small enough to be comparable to the sensitivity s p i (A) and those that are much larger. That is, we consider the coordinates i &#8712;</p><p>for some C &gt; 0, and all other coordinates. For the former set of coordinates, one can check that the Bernstein bound still applies, and S does preserve the norm of Ax -b p p , when restricted to these coordinates. On the other hand, for the latter set of coordinates, we show that no vector of the form Ax can both be close to b(i) in its i th entry, and still close to the remainder of b -the i th entry is simply too large in magnitude. In particular, to have [Ax](i) close to b(i), we would require Ax p p to be much larger than b p p , which by our preprocessing step, is on the order of the optimal cost min x Ax -b p . Via the triangle inequality, this implies that Ax must be far from an optimal solution. Thus, we can argue that any near-optimal solution to min x Ax -b p does not need to fit b(i) with |b(i)| p / b p p much larger than s p i (A). We can effectively ignore the contribution of these rows.</p><p>Another technical challenge remains: to obtain an optimal dimension dependence, we need a refined -net argument to make for all statements about x &#8712; R d . To do so, we adapt the chaining arguments of Bourgain, Lindenstrauss, and Milman <ref type="bibr">[7]</ref> and Ledoux and Talagrand <ref type="bibr">[34]</ref> to the active regression setting, avoiding the standard -net and union bound argument used by, e.g., <ref type="bibr">[47]</ref>. Although both <ref type="bibr">[7]</ref> and <ref type="bibr">[34]</ref> provide such approaches, we adapt the (slightly) more complex recursive Lewis weight sampling algorithm of <ref type="bibr">[34]</ref> in order to obtain tighter dependencies on . The streamlined proof of <ref type="bibr">[34]</ref> also adapts nicely to the active regression setting with minimal changes to the original argument. We note here that we will later also need to adapt the much more involved <ref type="bibr">[7]</ref> argument to handle the Huber loss, in which case the proof of <ref type="bibr">[7]</ref> allows for more fine-grained control over bounding the sensitivity sampling algorithm, but requires a more complex argument based on carefully partitioning the coordinates of the target vector b based on sensitivity weight classes. Aside from our new application of <ref type="bibr">[7]</ref>, <ref type="bibr">[34]</ref>, we hope that by translating the arguments of <ref type="bibr">[7]</ref>, <ref type="bibr">[34]</ref> to the language of theoretical computer science and matrix approximation, they will find further applications to randomized algorithm design.</p><p>We note that our algorithm is quite a bit more involved than a simple scheme of sampling proportionally to Lewis weights and solving. This is for good reasons. Not only is it not clear that such an approach works at all, we show that for any p &gt; 1, any algorithm which simply samples reweighted rows and solves the system must have a polynomial dependence on 1/&#948; in the query complexity, while our algorithm achieves a log 1 &#948; dependence, by solving residual problems of a constant factor solution. Thus, our two-stage approach is necessary to achieve our &#948; dependence. Furthermore, the best known analysis of a simple "one-shot" Lewis weight sampling scheme suffers in dependencies for p &gt; 2, where the one-shot approach is only known to give a &#213;(d p/2 / 5 ) bound for subspace embeddings <ref type="bibr">[7]</ref>, <ref type="bibr">[18]</ref>, whose losses translate to losses for our active regression algorithms as well, while the recursive approach can achieve &#213;(d p/2 / 2 ) <ref type="bibr">[34]</ref>. While the <ref type="bibr">[34]</ref> result is an existential result, we provide an analysis of the <ref type="bibr">[34]</ref> proof to turn it into a randomized algorithm with logarithmic dependencies on the failure rate &#948;, which achieves the best known dependence on d, , and &#948;, up to logarithmic factors. We further modify this subspace embedding result for active regression, to optimize our dependence. c) Optimized Dependence for p &#8712; (1, 2): For p &#8712; (1, 2), the above argument gives a bound of &#213;(d/ 2 ). While the linear dependence on d is optimal, confirming the conjecture of <ref type="bibr">[10]</ref>, it has a quadratic dependence on , which is in fact not optimal. We now show how to improve our bound to &#213;(d/ ), which requires additional ideas. We first use strong convexity to show that a</p><p>Then, by using that x is close to the optimal solution, we show an improved bound on the difference in the objective values of x and x * , i.e., that x actually has an approximation ratio better than (1 + &#947;). We then iterate this argument until we obtain a (1+ )-approximation using &#213;(d/ ) queries, at which point we can no longer get improvements.</p><p>The chaining argument used in this proof, while similar to the previous proofs, has a different geometry than the previous chaining arguments, and requires additional ideas. Our upper bound is tight up to polylogarithmic factors due to a lower bound we show. This improves an p power means lower bound of <ref type="bibr">[19]</ref>, who only showed a lower bound of &#937;( 1-p ) queries. Unfortunately, we are unable to port our algorithmic techniques to the p power means problem in high dimensions, due to difficulties in adapting their chaining argument.</p><p>2) Sensitivity Bounds: The notion of p sensitivities, as discussed above, naturally generalizes to loss functions that take the form of coordinate-wise sums. Consider a loss function M and an n &#215; d matrix A. Then, the sensitivity of the ith coordinate with respect to the loss function M is defined as</p><p>It is well-established that sensitivities provide a general framework for sampling rows of A that approximate A well under the loss function M <ref type="bibr">[25]</ref>. While a rich literature exists for p <ref type="bibr">[18]</ref>, <ref type="bibr">[20]</ref>, <ref type="bibr">[49]</ref>, little was known about the approximation of sensitivities for general loss functions until <ref type="bibr">[52]</ref>, which used L&#246;wner-John ellipsoids to obtain sensitivity bounds for a general family of near-convex losses. However, the computation of L&#246;wner-John ellipsoids has running time that is a large polynomial in n and d, and is impractical for large datasets, and <ref type="bibr">[52]</ref> raise the open question of obtaining general sensitivity bounds without this expensive subroutine.</p><p>Our approach to new sensitivity bounds significantly generalizes the approach of <ref type="bibr">[13]</ref>, whose algorithm can be seen as a way to use hashing and Lewis weights to compute sensitivities for the Tukey loss, but heavily uses the properties of the Tukey loss in their analysis.</p><p>Suppose that a coordinate i &#8712; [n] has M -sensitivity &#945; &#8712; (0, 1], that is,</p><p>and let y = Ax witness this supremum, and assume for simplicity that n i=1 M ([Ax](j)) = 1. Note then that there can be at most 1/&#945; entries j &#8712; [n] of y that have coordinate value M (y j ) &#8805; M (y i ) = &#945;. Then, if we randomly hash the n coordinates into O(1/&#945;) buckets, then with constant probability, coordinate i will be isolated from any other entry with M (y j ) &#8805; M (y i ) = &#945;. Now if M is monotonic, then this means that y i is the largest coordinate in its hash bucket. Furthermore, the sum of the M -mass of all of the other coordinates in i's hash bucket is only an &#945; fraction of the total M -mass, so entry i carries a constant fraction of the Mmass in its bucket. In this case, it can be shown that entry i must in fact carry a constant fraction of the 2 mass inside its hash bucket, if M is a function of at most quadratic growth. This is because when we switch the error metric from M to 2 , then the largest entry will have the largest increase in its normalized contribution. This means that row i must have an 2 leverage score of &#937;(1), in this hash bucket.</p><p>This leads to the following algorithm: (1) hash the n coordinates into O(1/&#945;) buckets (2) compute 2 leverage scores for each bucket (3) assign an M -sensitivity of &#945; for any coordinate that has 2 leverage score &#937;(1). In each of the O(1/&#945;) buckets, we will find at most O(d) coordinates with leverage score at least &#937;(1), so we assign an M -sensitivity of &#945; to at most O(d/&#945;) coordinates, which has a total sensitivity contribution of O(d). By repeating this for O(log n) guesses of &#945; in powers of 2, this gives a total sensitivity bound of O(d log n). The constant probability events in the hashing process can be boosted to probability 1 -1/ poly(n) by repeating the procedure O(log n) times, which increases the total sensitivity to roughly O(d log 2 n).</p><p>By sampling according to these sensitivities and applying a union bound over a net, we obtain the first active regression algorithms for general loss functions. Note that this result is made possible by a combination of both our new sensitivity bounds for M -estimators and our new active regression techniques as discussed in Section I-C1. Furthermore, we demonstrate other applications of our sensitivity bound result, showing how to improve Orlicz norm subspace embeddings and robust subspace approximation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3) Subspace Embeddings and Active Regression for the Huber Loss:</head><p>As discussed previously, we tackle the question of leveraging the theory of <ref type="bibr">[7]</ref> nets in order to obtain sample complexities for the Huber loss beyond d 2 . Our algorithmic framework for active regression is based on the earlier idea of partitioning the entries of b by sensitivity and then applying sensitivity sampling, so we focus on the problem of preserving the Huber norm using an improved sensitivity sampling technique. Note that unlike the p losses, the Huber loss is not scale-invariant. Furthermore, perhaps the largest obstacle in designing row sampling algorithms for the Huber loss going beyond standard -net arguments is that there is no analogue of the chaining constructions of <ref type="bibr">[7]</ref>, <ref type="bibr">[48]</ref> for the Huber loss. This can also be attributed to the fact that the Huber loss is not scale-invariant, which precludes an isometric change-ofdensity type theorem for the Huber loss as done in <ref type="bibr">[36]</ref>, <ref type="bibr">[48]</ref>. We show how to overcome these obstacles in the following discussion.</p><p>a) A Sharp Huber Inequality.: Our algorithmic framework follows the Huber algorithm of <ref type="bibr">[15]</ref>, which is a recursive sampling algorithm which reduces the number of rows from n to roughly n 1/2 d 2 in each recursive application of the algorithm. To show this result, <ref type="bibr">[15]</ref> first show in their Lemma 2.1 that the Huber norm is within a factor of O(n 1/2 ) of the smaller of the 1 and 2 norms: </p><p>It can be shown that the above lemma implies that the Huber sensitivities are within a factor of O(n 1/2 ) of the sum of the 1 and 2 sensitivities. This motivates the idea of sampling the rows of A with probability proportional to the sum of the 1 and 2 Lewis weights, oversampled by a factor of O(n 1/2 ). This is indeed how <ref type="bibr">[15]</ref> proceeds.</p><p>The recursion n &#8594; n 1/2 d 2 solves to a final row count of around d 4 , which is quadratically worse than our general loss function result of d 2 using our new sensitivity upper bounds and our general framework. To improve this further, first note that two improvements can be made to the above argument. First, by using the Huber inequality in a different way, we can use it in conjunction with the <ref type="bibr">[7]</ref> net bounds, which reduces the row count in one recursive application to roughly n 1/2 d rather than n 1/2 d 2 . This reduces the overall row count to d 2 after solving for the recursion, but this still does not beat our general purpose sensitivity sampling algorithm, despite the use of the <ref type="bibr">[7]</ref> nets. The second improvement is that the Huber inequality as proved in <ref type="bibr">[15]</ref> is in fact loose by a polynomial factor in n, and can be improved to the following: Lemma I.8 (Huber Inequality version 2). Let y &#8712; R n . Then,</p><p>This lemma is tight up to constant factors 10 , and gives a recursion of roughly n &#8594; n 1/3 d, giving</p><p>rows, which shaves a factor of approximately &#8730; d over the na&#239;ve Bernstein bound over a net.</p><p>b) Storing Large Huber Sensitivities.: In order to further improve upon this bound, we crucially make use of our improved sensitivity bounds and a generalized version of the above Huber inequality lemma that is parameterized by an upper bound on the size of the entries of y. Lemma I.9 (Huber Inequality version 3). Let y &#8712; R n and let</p><p>Then, for some constant c &gt; 0, at least one of the following bounds holds: 10 Consider the vector with one coordinate with n 1/3 and (n -1) coordinates with n -1/3 . By directly including the rows of A with Huber sensitivity at least &#947;, we exactly preserve the Huber norm inside T = [n] \ T for every y. On the remaining coordinates inside T , we then have an improved Huber inequality, which implies an improved sampling bound. By balancing the number of rows which we directly include, which is roughly d/&#947;, and the sampling bound inside T , which is roughly (1/&#947; + (&#947;n) </p><p>Then, for some c &gt; 0 and &#946; = 3 -2 &#8730; 2 &#8776; 0.17157, at least one of the following bounds holds:</p><p>In fact, we prove a generalized bound for the 2q loss for any q &#8712; (0, 2) in Lemma I.11. The interval p &#8712; <ref type="bibr">[1,</ref><ref type="bibr">2]</ref> can be discretized in increments of 1  log n , so with O(log n) applications of <ref type="bibr">[7]</ref> nets, we can always find a p within an additive 1  log n of the optimal p for every net vector y, which only affects Lemma I.10 by constant factors whenever y has entries bounded by poly(n). By proceeding as previously discussed, we arrive at our final bound of</p><p>d) Extensions to 2q Loss.: We generalize our results to the 2q loss for q &#8712; (0, 2). As q ranges from 0 to 1 to 2, the 2q interpolates between the Tukey, Huber, and 2 losses up to constant factors, and provides a natural generalization of these loss function.</p><p>Lemma I.11 ( 2q Inequality). Let q &#8712; (0, 2) and define</p><p>Then, for some constant c &gt; 0, at least one of the following bounds holds:</p><p>where &#946; = 1 2/q -1 (2/q + 1) -2 2/q .</p><p>For q &#8712; [1, 2), the 1/&#947; 2/q-1 distortion in Equation ( <ref type="formula">2</ref>) is smaller than the 1/&#947; factor incurred from keeping Msensitivities at least &#947;, so we can balance the parameters as 1/&#947; = (&#947;n) &#946; as before, which leads to a recursion that gives us a bound of n = d 1+&#946; poly((log n)/ ). For q &#8712; (0, 1), the 1/&#947; 2/q-1 distortion in Equation ( <ref type="formula">2</ref>) is worse than 1/&#947;, which means we must balance 1/&#947; 2/q-1 = (&#947;n) &#946; , or &#947; = n -&#946;/(2/q-1+&#946;) , which gives a worse bound of n = d &#947; poly((log n)/ ) for</p><p>This is better than a d 2 bound as long as q &#8805; ( &#8730; 5 -1) 2 /8 &#8776; 0.19098.</p><p>Fig. <ref type="figure">1</ref>: Dependence on d for the active regression sample complexity for the 2q loss. Similar bounds apply to subspace embeddings as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Conclusions and Future Directions</head><p>In this work, we study the sample complexity of active linear regression for both the p norm as well as general Mestimator losses.</p><p>For the p norm, we provide optimal algorithms and lower bounds for p &#8712; (0, 2), with &#920;(d/ 2 ) samples for p &#8712; (0, 1) and &#920;(d/ ) samples for p &#8712; (1, 2). For p &gt; 2, we provide an upper bound of &#213;(d p/2 / p ), which is optimal in the d dependence and off by a single factor in the dependence, up to polylogarithmic factors. Our algorithms provide the first nontrivial bounds, i.e., sample complexity less than n, for p &#8712; (0, 1) &#8746; (2, &#8734;), while for p &#8712; (1, 2), we significantly improve upon the &#213;(d 2 / 2 ) upper bound of <ref type="bibr">[10]</ref> and answer their main open question. We obtain these results via a twostage algorithm and a novel sensitivity partitioning technique for every p, as well as an iterative improvement argument via strong convexity and Lewis bases to improve the dependence for p &#8712; (1, 2). Our result is the first to achieve a linear dependence on for dimension reduction for p regression for p &#8712; (1, 2).</p><p>Next, we obtain a new sensitivity bound which achieves optimal total sensitivity bounds for M -estimators of at most polynomial growth, which runs in input sparsity time and avoids the use of L&#246;wner-John ellipsoids. This answers an open question of <ref type="bibr">[52]</ref> and makes significant progress in the general direction of matrix approximation beyond p losses. By combining this with our new active regression techniques, we obtain active regression algorithms for general M -estimator losses, including the Tukey and Huber losses, which answers an open question of <ref type="bibr">[10]</ref>.</p><p>For the important special case of the Huber loss, we introduce new techniques which bound Huber sensitivities by the sum of p Lewis weights, which allows us to take advantage of chaining arguments for p in order to obtain an active regression algorithm making at most O(d 4-2 &#8730; 2 poly(log n, -1 )) queries. Our techniques also give subspace embeddings with the same number of rows. This is the first dimension reduction result for losses other than p to approximate a d-dimensional subspace with fewer than d 2 dimensions. This improves over a previous bound of d 4 for the Huber loss, which held only for subspace embeddings, and not active regression, in <ref type="bibr">[15]</ref>.</p><p>Finally, our results and techniques give many applications in a wide variety of related problems. Our lower bounds for active regression give improved lower bounds for the sublinear power means problem <ref type="bibr">[19]</ref>; our new sensitivity bounding techniques sharpen and generalize previous results on Orlicz norm subspace embeddings <ref type="bibr">[51]</ref> and robust subspace approximation <ref type="bibr">[14]</ref>; our techniques for dimension reduction for the Huber loss gives improved bounds for sparsification for &#947; p functions for applications in fast algorithms for p regression <ref type="bibr">[1]</ref>, <ref type="bibr">[28]</ref>. We believe that our techniques will be applicable further, and hope to see more uses in future work.</p><p>We conclude with questions that are still left open by our work. Perhaps the most pressing is to resolve the query complexity of active p regression for p &gt; 2: our upper bound is &#213;(d p/2 / p ), while the lower bound is &#937;(d p/2 + 1-p ). Closing this gap would be interesting. Our bounds are also loose by a factor of log 1 &#948; for all p &gt; 0, while we can get an optimal dependence on &#948; if we assume knowledge of the optimal value and sacrifice a factor of . A natural question if one can achieve a simultaneously optimal dependence on d, , and &#948;, up to logarithmic factors, and without assumptions. Another gap to close is the query complexity of Huber regression, or more generally M -estimator regression, even for just the d dependence: our upper bound is &#213;(d 4-2 &#8730; 2 poly log n) for constant , while only a trivial lower bound of &#937;(d) is known.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>All query complexity bounds in this section are stated for solving Problem I.1 with high constant probability -e.g., probability 99/100. In later sections we will include an explicit dependence on a failure probability &#948;.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1"><p>Throughout, &#213; is used to suppress polylogarithmic factors in the argument.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>Note that for p &#8712; (0, 1), &#8226; p is not a norm, but we refer to it as a norm by a standard abuse of notation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3"><p>Here, a &#8744; b denotes max(a, b), and a &#8743; b denotes min(a, b).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4"><p>For a, b &#8805; 0, a &#177; b denotes a number c such that ab &#8804; c &#8804; a + b.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_5"><p>Authorized licensed use limited to: New York University. Downloaded on March 30,2023 at 04:24:50 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6"><p>Again, this is a standard abuse of notation, and the Huber norm is not an actual norm.</p></note>
		</body>
		</text>
</TEI>
