skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Outlier Robust Multivariate Polynomial Regression
We study the problem of robust multivariate polynomial regression: let p: ℝⁿ → ℝ be an unknown n-variate polynomial of degree at most d in each variable. We are given as input a set of random samples (𝐱_i,y_i) ∈ [-1,1]ⁿ × ℝ that are noisy versions of (𝐱_i,p(𝐱_i)). More precisely, each 𝐱_i is sampled independently from some distribution χ on [-1,1]ⁿ, and for each i independently, y_i is arbitrary (i.e., an outlier) with probability at most ρ < 1/2, and otherwise satisfies |y_i-p(𝐱_i)| ≤ σ. The goal is to output a polynomial p̂, of degree at most d in each variable, within an 𝓁_∞-distance of at most O(σ) from p. Kane, Karmalkar, and Price [FOCS'17] solved this problem for n = 1. We generalize their results to the n-variate setting, showing an algorithm that achieves a sample complexity of O_n(dⁿlog d), where the hidden constant depends on n, if χ is the n-dimensional Chebyshev distribution. The sample complexity is O_n(d^{2n}log d), if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most O(σ), and the run-time depends on log(1/σ). In the setting where each 𝐱_i and y_i are known up to N bits of precision, the run-time’s dependence on N is linear. We also show that our sample complexities are optimal in terms of dⁿ. Furthermore, we show that it is possible to have the run-time be independent of 1/σ, at the cost of a higher sample complexity.  more » « less
Award ID(s):
2022446
PAR ID:
10633993
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Chan, Timothy; Fischer, Johannes; Iacono, John; Herman, Grzegorz
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
308
ISSN:
1868-8969
ISBN:
978-3-95977-338-6
Page Range / eLocation ID:
12:1-12:17
Subject(s) / Keyword(s):
Robust Statistics Polynomial Regression Sample Efficient Learning Theory of computation → Continuous optimization
Format(s):
Medium: X Size: 17 pages; 1098877 bytes Other: application/pdf
Size(s):
17 pages 1098877 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the problem of robust multivariate polynomial regression: let p\colon\mathbb{R}^n\to\mathbb{R} be an unknown n-variate polynomial of degree at most d in each variable. We are given as input a set of random samples (\mathbf{x}_i,y_i) \in [-1,1]^n \times \mathbb{R} that are noisy versions of (\mathbf{x}_i,p(\mathbf{x}_i)). More precisely, each \mathbf{x}_i is sampled independently from some distribution \chi on [-1,1]^n, and for each i independently, y_i is arbitrary (i.e., an outlier) with probability at most \rho < 1/2, and otherwise satisfies |y_i-p(\mathbf{x}_i)|\leq\sigma. The goal is to output a polynomial \hat{p}, of degree at most d in each variable, within an \ell_\infty-distance of at most O(\sigma) from p. Kane, Karmalkar, and Price [FOCS'17] solved this problem for n=1. We generalize their results to the n-variate setting, showing an algorithm that achieves a sample complexity of O_n(d^n\log d), where the hidden constant depends on n, if \chi is the n-dimensional Chebyshev distribution. The sample complexity is O_n(d^{2n}\log d), if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most O(\sigma), and the run-time depends on \log(1/\sigma). In the setting where each \mathbf{x}_i and y_i are known up to N bits of precision, the run-time's dependence on N is linear. We also show that our sample complexities are optimal in terms of d^n. Furthermore, we show that it is possible to have the run-time be independent of 1/\sigma, at the cost of a higher sample complexity. 
    more » « less
  2. Aichholzer, Oswin; Wang, Haitao (Ed.)
    We present new results on 2- and 3-hop spanners for geometric intersection graphs. These include improved upper and lower bounds for 2- and 3-hop spanners for many geometric intersection graphs in ℝ^d. For example, we show that the intersection graph of n balls in ℝ^d admits a 2-hop spanner of size O^*(n^{3/2 - 1/(2(2⌊d/2⌋ + 1))}) and the intersection graph of n fat axis-parallel boxes in ℝ^d admits a 2-hop spanner of size O(n log^{d+1}n). Furthermore, we show that the intersection graph of general semi-algebraic objects in ℝ^d admits a 3-hop spanner of size O^*(n^{3/2 - 1/(2(2D-1))}), where D is a parameter associated with the description complexity of the objects. For such families (or more specifically, for tetrahedra in ℝ³), we provide a lower bound of Ω(n^{4/3}). For 3-hop and axis-parallel boxes in ℝ^d, we provide the upper bound O(n log ^{d-1}n) and lower bound Ω(n ({log n}/{log log n})^{d-2}). 
    more » « less
  3. Multivariate multipoint evaluation is the problem of evaluating a multivariate polynomial, given as a coefficient vector, simultaneously at multiple evaluation points. In this work, we show that there exists a deterministic algorithm for multivariate multipoint evaluation over any finite field F that outputs the evaluations of an m-variate polynomial of degree less than d in each variable at N points in time (dm + N)1+o(1) · poly(m, d, log |F|) for all m ∈ N and all sufficiently large d ∈ N. A previous work of Kedlaya and Umans (FOCS 2008, SICOMP 2011) achieved the same time complexity when the number of variables m is at most d^{o(1)} and had left the problem of removing this condition as an open problem. A recent work of Bhargava, Ghosh, Kumar and Mohapatra (STOC 2022) answered this question when the underlying field is not too large and has characteristic less than d^{o(1)}. In this work, we remove this constraint on the number of variables over all finite fields, thereby answering the question of Kedlaya and Umans over all finite fields. Our algorithm relies on a non-trivial combination of ideas from three seemingly different previously knownalgorithms for multivariate multipoint evaluation, namely the algorithms of Kedlaya and Umans, that of Björklund, Kaski and Williams (IPEC 2017, Algorithmica 2019), and that of Bhargava, Ghosh, Kumar and Mohapatra, together with a result of Bombieri and Vinogradov from analytic number theory about the distribution of primes in an arithmetic progression. We also present a second algorithm for multivariate multipoint evaluation that is completely elementary and in particular, avoids the use of the Bombieri–Vinogradov Theorem. However, it requires a mild assumption that the field size is bounded by an exponential-tower in d of bounded height. 
    more » « less
  4. Cormode, Graham; Shekelyan, Michael (Ed.)
    We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely. 
    more » « less
  5. We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a d-variate normal N(μ,Σ) means a samples is only revealed if it falls in some subset S⊆Rd; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean μ and covariance matrix Σ can be estimated with arbitrary accuracy in polynomial-time, as long as we have oracle access to S, and S has non-trivial measure under the unknown d-variate normal distribution. Additionally we show that without oracle access to S, any non-trivial estimation is impossible. 
    more » « less