We study the problem of robust multivariate polynomial regression: let p\colon\mathbb{R}^n\to\mathbb{R} be an unknown n-variate polynomial of degree at most d in each variable. We are given as input a set of random samples (\mathbf{x}_i,y_i) \in [-1,1]^n \times \mathbb{R} that are noisy versions of (\mathbf{x}_i,p(\mathbf{x}_i)). More precisely, each \mathbf{x}_i is sampled independently from some distribution \chi on [-1,1]^n, and for each i independently, y_i is arbitrary (i.e., an outlier) with probability at most \rho < 1/2, and otherwise satisfies |y_i-p(\mathbf{x}_i)|\leq\sigma. The goal is to output a polynomial \hat{p}, of degree at most d in each variable, within an \ell_\infty-distance of at most O(\sigma) from p. Kane, Karmalkar, and Price [FOCS'17] solved this problem for n=1. We generalize their results to the n-variate setting, showing an algorithm that achieves a sample complexity of O_n(d^n\log d), where the hidden constant depends on n, if \chi is the n-dimensional Chebyshev distribution. The sample complexity is O_n(d^{2n}\log d), if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most O(\sigma), and the run-time depends on \log(1/\sigma). In the setting where each \mathbf{x}_i and y_i are known up to N bits of precision, the run-time's dependence on N is linear. We also show that our sample complexities are optimal in terms of d^n. Furthermore, we show that it is possible to have the run-time be independent of 1/\sigma, at the cost of a higher sample complexity.
more »
« less
Memory-Sample Tradeoffs for Linear Regression with Small Error
We consider the problem of performing linear regression over a stream of d-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples (a_1,b_1), (a_2,b_2)..., with a_i drawn independently from a d-dimensional isotropic Gaussian, and where b_i = + \eta_i, for a fixed x in R^d with ||x||= 1 and with independent noise \eta_i drawn uniformly from the interval [-2^{-d/5},2^{-d/5}]. We show that any algorithm with at most d^2/4 bits of memory requires at least \Omega(d \log \log \frac{1}{\epsilon}) samples to approximate x to \ell_2 error \epsilon with probability of success at least 2/3, for \epsilon sufficiently small as a function of d. In contrast, for such \epsilon, x can be recovered to error \epsilon with probability 1-o(1) with memory O\left(d^2 \log(1/\epsilon)\right) using d examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.
more »
« less
- PAR ID:
- 10098922
- Date Published:
- Journal Name:
- Symposium on Theory of Computing (STOC)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A $$(1\pm\epsilon)$$ -sparsifier of a hypergraph $G(V, E)$ is a (weighted) subgraph that preserves the value of every cut to within a $$(1\pm\epsilon)$$ -factor. It is known that every hypergraph with $$n$$ vertices admits a $$(1 \pm \epsilon)$$ -sparsifier with $$\tilde{O}(n/{\epsilon}^{2})$$ hyperedges. In this work, we explore the task of building such a sparsifier by using only linear measurements (a linear sketch) over the hyperedges of $$G$$, and provide nearly-matching upper and lower bounds for this task. Specifically, we show that there is a randomized linear sketch of size $$\tilde{O}(nr\log(m)/\epsilon^{2})$$ bits which with high probability contains sufficient information to recover a $$(1\pm\epsilon)$$ cut-sparsifier with $$\tilde{O}(n/\epsilon^{2})$$ hyperedges for any hypergraph with at most $$m$$ edges each of which has arity bounded by $$r$$. This immediately gives a dynamic streaming algorithm for hypergraph cut sparsification with an identical space complexity, improving on the previous best known bound of $$\tilde{O}(nr^{2}\log^{4}({m})/\epsilon^{2})$$ bits of space (Guha, McGregor, and Tench, PODS 2015). We complement our algorithmic result above with a nearly-matching lower bound. We show that for every $$\epsilon\in(0,1)$$, one needs $$\Omega(nr\log(m/n)/\log(n))$$ bits to construct a $$(1\pm\epsilon)$$ -sparsifier via linear sketching, thus showing that our linear sketch achieves an optimal dependence on both $$r$$ and $$\log(m)$$. The starting point for our improved algorithm is importance sampling of hyperedges based on the new notion of $$k$$ -cut strength introduced in the recent work of Quanrud (SODA 2024). The natural algorithm based on this concept leads to $$\log m$$ levels of sampling where errors can potentially accumulate, and this accounts for the polylog $(m)$ losses in the sketch size of the natural algorithm. We develop a more intricate analysis of the accumulation in error to show most levels do not contribute to the error and actual loss is only polylog $(n)$. Combining with careful preprocessing (and analysis) this enables us to get rid of all extraneous $$\log m$$ factors in the sketch size, but the quadratic dependence on $$r$$ remains. This dependence originates from use of correlated $$\ell_{0}$$ -samplers to recover a large number of low-strength edges in a hypergraph simultaneously by looking at neighborhoods of individual vertices. In graphs, this leads to discovery of $$\Omega(n)$$ edges in a single shot, whereas in hypergraphs, this may potentially only reveal $$O$$($$n$$/$$r$$) new edges, thus requiring $$\Omega(r)$$ rounds of recovery. To remedy this we introduce a new technique of random fingerprinting of hyperedges which effectively eliminates the correlations created by large arity hyperedges, and leads to a scheme for recovering hyperedges of low strength with an optimal dependence on $$r$$. Putting all these ingredients together yields our linear sketching algorithm. Our lower bound is established by a reduction from the universal relation problem in the one-way communication setting.more » « less
-
A $$(1\pm\epsilon)$$ -sparsifier of a hypergraph $G(V, E)$ is a (weighted) subgraph that preserves the value of every cut to within a $$(1\pm\epsilon)$$ -factor. It is known that every hypergraph with $$n$$ vertices admits a $$(1 \pm \epsilon)$$ -sparsifier with $$\tilde{O}(n/{\epsilon}^{2})$$ hyperedges. In this work, we explore the task of building such a sparsifier by using only linear measurements (a linear sketch) over the hyperedges of $$G$$, and provide nearly-matching upper and lower bounds for this task. Specifically, we show that there is a randomized linear sketch of size $$\tilde{O}(nr\log(m)/\epsilon^{2})$$ bits which with high probability contains sufficient information to recover a $$(1\pm\epsilon)$$ cut-sparsifier with $$\tilde{O}(n/\epsilon^{2})$$ hyperedges for any hypergraph with at most $$m$$ edges each of which has arity bounded by $$r$$. This immediately gives a dynamic streaming algorithm for hypergraph cut sparsification with an identical space complexity, improving on the previous best known bound of $$\tilde{O}(nr^{2}\log^{4}({m})/\epsilon^{2})$$ bits of space (Guha, McGregor, and Tench, PODS 2015). We complement our algorithmic result above with a nearly-matching lower bound. We show that for every $$\epsilon\in(0,1)$$, one needs $$\Omega(nr\log(m/n)/\log(n))$$ bits to construct a $$(1\pm\epsilon)$$ -sparsifier via linear sketching, thus showing that our linear sketch achieves an optimal dependence on both $$r$$ and $$\log(m)$$. The starting point for our improved algorithm is importance sampling of hyperedges based on the new notion of $$k$$ -cut strength introduced in the recent work of Quanrud (SODA 2024). The natural algorithm based on this concept leads to $$\log m$$ levels of sampling where errors can potentially accumulate, and this accounts for the polylog $(m)$ losses in the sketch size of the natural algorithm. We develop a more intricate analysis of the accumulation in error to show most levels do not contribute to the error and actual loss is only polylog $(n)$. Combining with careful preprocessing (and analysis) this enables us to get rid of all extraneous $$\log m$$ factors in the sketch size, but the quadratic dependence on $$r$$ remains. This dependence originates from use of correlated $$\ell_{0}$$ -samplers to recover a large number of low-strength edges in a hypergraph simultaneously by looking at neighborhoods of individual vertices. In graphs, this leads to discovery of $$\Omega(n)$$ edges in a single shot, whereas in hypergraphs, this may potentially only reveal $$O$$($$n$$/$$r$$) new edges, thus requiring $$\Omega(r)$$ rounds of recovery. To remedy this we introduce a new technique of random fingerprinting of hyperedges which effectively eliminates the correlations created by large arity hyperedges, and leads to a scheme for recovering hyperedges of low strength with an optimal dependence on $$r$$. Putting all these ingredients together yields our linear sketching algorithm. Our lower bound is established by a reduction from the universal relation problem in the one-way communication setting.more » « less
-
Abstract We present a new elementary algorithm that takes $$ \textrm{time} \ \ O_\epsilon \left( x^{\frac{3}{5}} (\log x)^{\frac{8}{5}+\epsilon } \right) \ \ \textrm{and} \ \textrm{space} \ \ O\left( x^{\frac{3}{10}} (\log x)^{\frac{13}{10}} \right) $$ time O ϵ x 3 5 ( log x ) 8 5 + ϵ and space O x 3 10 ( log x ) 13 10 (measured bitwise) for computing $$M(x) = \sum _{n \le x} \mu (n),$$ M ( x ) = ∑ n ≤ x μ ( n ) , where $$\mu (n)$$ μ ( n ) is the Möbius function. This is the first improvement in the exponent of x for an elementary algorithm since 1985. We also show that it is possible to reduce space consumption to $$O(x^{1/5} (\log x)^{5/3})$$ O ( x 1 / 5 ( log x ) 5 / 3 ) by the use of (Helfgott in: Math Comput 89:333–350, 2020), at the cost of letting time rise to the order of $$x^{3/5} (\log x)^2 \log \log x$$ x 3 / 5 ( log x ) 2 log log x .more » « less
-
Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasi-gradients’. Whenever these quasi-gradients exist, a large family of no-regret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any first-order stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $$\epsilon < 1/3$$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasi-gradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $$O(\sqrt{\epsilon })$$ to the optimal rate of $$O(\epsilon )$$ for small $$\epsilon $$ assuming certified hypercontractivity. For mean estimation with near-identity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $$\tilde{O}(d/\epsilon ^2)$$.more » « less
An official website of the United States government

