skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Computing Data Distribution from Query Selectivities
We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.  more » « less
Award ID(s):
2402823
PAR ID:
10616164
Author(s) / Creator(s):
; ; ;
Editor(s):
Cormode, Graham; Shekelyan, Michael
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
290
ISSN:
1868-8969
ISBN:
978-3-95977-312-6
Page Range / eLocation ID:
18:1-18:20
Subject(s) / Keyword(s):
selectivity queries discrete distributions Multiplicative Weights Update eps-approximation learnable functions depth problem arrangement Theory of computation → Computational geometry
Format(s):
Medium: X Size: 20 pages; 929722 bytes Other: application/pdf
Size(s):
20 pages 929722 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. Mulzer, Wolfgang; Phillips, Jeff M (Ed.)
    Let P be a set of m points in ℝ², let Σ be a set of n semi-algebraic sets of constant complexity in ℝ², let (S,+) be a semigroup, and let w: P → S be a weight function on the points of P. We describe a randomized algorithm for computing w(P∩σ) for every σ ∈ Σ in overall expected time O^*(m^{2s/(5s-4)}n^{(5s-6)/(5s-4)} + m^{2/3}n^{2/3} + m + n), where s > 0 is a constant that bounds the maximum complexity of the regions of Σ, and where the O^*(⋅) notation hides subpolynomial factors. For s ≥ 3, surprisingly, this bound is smaller than the best-known bound for answering m such queries in an on-line manner. The latter takes O^*(m^{s/(2s-1)}n^{(2s-2)/(2s-1)} + m + n) time. Let Φ: Σ × P → {0,1} be the Boolean predicate (of constant complexity) such that Φ(σ,p) = 1 if p ∈ σ and 0 otherwise, and let Σ_Φ P = {(σ,p) ∈ Σ× P ∣ Φ(σ,p) = 1}. Our algorithm actually computes a partition ℬ_Φ of Σ_Φ P into bipartite cliques (bicliques) of size (i.e., sum of the sizes of the vertex sets of its bicliques) O^*(m^{2s/(5s-4)}n^{(5s-6)/(5s-4)} + m^{2/3}n^{2/3} + m + n). It is straightforward to compute w(P∩σ) for all σ ∈ Σ from ℬ_Φ. Similarly, if η: Σ → S is a weight function on the regions of Σ, ∑_{σ ∈ Σ: p ∈ σ} η(σ), for every point p ∈ P, can be computed from ℬ_Φ in a straightforward manner. We also mention a few other applications of computing ℬ_Φ. 
    more » « less
  2. Given a metric space ℳ = (X,δ), a weighted graph G over X is a metric t-spanner of ℳ if for every u,v ∈ X, δ(u,v) ≤ δ_G(u,v) ≤ t⋅ δ(u,v), where δ_G is the shortest path metric in G. In this paper, we construct spanners for finite sets in metric spaces in the online setting. Here, we are given a sequence of points (s₁, …, s_n), where the points are presented one at a time (i.e., after i steps, we have seen S_i = {s₁, … , s_i}). The algorithm is allowed to add edges to the spanner when a new point arrives, however, it is not allowed to remove any edge from the spanner. The goal is to maintain a t-spanner G_i for S_i for all i, while minimizing the number of edges, and their total weight. Under the L₂-norm in ℝ^d for arbitrary constant d ∈ ℕ, we present an online (1+ε)-spanner algorithm with competitive ratio O_d(ε^{-d} log n), improving the previous bound of O_d(ε^{-(d+1)}log n). Moreover, the spanner maintained by the algorithm has O_d(ε^{1-d}log ε^{-1})⋅ n edges, almost matching the (offline) optimal bound of O_d(ε^{1-d})⋅ n. In the plane, a tighter analysis of the same algorithm provides an almost quadratic improvement of the competitive ratio to O(ε^{-3/2}logε^{-1}log n), by comparing the online spanner with an instance-optimal spanner directly, bypassing the comparison to an MST (i.e., lightness). As a counterpart, we design a sequence of points that yields a Ω_d(ε^{-d}) lower bound for the competitive ratio for online (1+ε)-spanner algorithms in ℝ^d under the L₁-norm. Then we turn our attention to online spanners in general metrics. Note that, it is not possible to obtain a spanner with stretch less than 3 with a subquadratic number of edges, even in the offline setting, for general metrics. We analyze an online version of the celebrated greedy spanner algorithm, dubbed ordered greedy. With stretch factor t = (2k-1)(1+ε) for k ≥ 2 and ε ∈ (0,1), we show that it maintains a spanner with O(ε^{-1}logε^{-1})⋅ n^{1+1/k} edges and O(ε^{-1}n^{1/k}log² n) lightness for a sequence of n points in a metric space. We show that these bounds cannot be significantly improved, by introducing an instance that achieves an Ω(1/k⋅ n^{1/k}) competitive ratio on both sparsity and lightness. Furthermore, we establish the trade-off among stretch, number of edges and lightness for points in ultrametrics, showing that one can maintain a (2+ε)-spanner for ultrametrics with O(ε^{-1}logε^{-1})⋅ n edges and O(ε^{-2}) lightness. 
    more » « less
  3. Given a metric space ℳ = (X,δ), a weighted graph G over X is a metric t-spanner of ℳ if for every u,v ∈ X, δ(u,v) ≤ δ_G(u,v) ≤ t⋅ δ(u,v), where δ_G is the shortest path metric in G. In this paper, we construct spanners for finite sets in metric spaces in the online setting. Here, we are given a sequence of points (s₁, …, s_n), where the points are presented one at a time (i.e., after i steps, we have seen S_i = {s₁, … , s_i}). The algorithm is allowed to add edges to the spanner when a new point arrives, however, it is not allowed to remove any edge from the spanner. The goal is to maintain a t-spanner G_i for S_i for all i, while minimizing the number of edges, and their total weight. Under the L₂-norm in ℝ^d for arbitrary constant d ∈ ℕ, we present an online (1+ε)-spanner algorithm with competitive ratio O_d(ε^{-d} log n), improving the previous bound of O_d(ε^{-(d+1)}log n). Moreover, the spanner maintained by the algorithm has O_d(ε^{1-d}log ε^{-1})⋅ n edges, almost matching the (offline) optimal bound of O_d(ε^{1-d})⋅ n. In the plane, a tighter analysis of the same algorithm provides an almost quadratic improvement of the competitive ratio to O(ε^{-3/2}logε^{-1}log n), by comparing the online spanner with an instance-optimal spanner directly, bypassing the comparison to an MST (i.e., lightness). As a counterpart, we design a sequence of points that yields a Ω_d(ε^{-d}) lower bound for the competitive ratio for online (1+ε)-spanner algorithms in ℝ^d under the L₁-norm. Then we turn our attention to online spanners in general metrics. Note that, it is not possible to obtain a spanner with stretch less than 3 with a subquadratic number of edges, even in the offline setting, for general metrics. We analyze an online version of the celebrated greedy spanner algorithm, dubbed ordered greedy. With stretch factor t = (2k-1)(1+ε) for k ≥ 2 and ε ∈ (0,1), we show that it maintains a spanner with O(ε^{-1}logε^{-1})⋅ n^{1+1/k} edges and O(ε^{-1}n^{1/k}log² n) lightness for a sequence of n points in a metric space. We show that these bounds cannot be significantly improved, by introducing an instance that achieves an Ω(1/k⋅ n^{1/k}) competitive ratio on both sparsity and lightness. Furthermore, we establish the trade-off among stretch, number of edges and lightness for points in ultrametrics, showing that one can maintain a (2+ε)-spanner for ultrametrics with O(ε^{-1}logε^{-1})⋅ n edges and O(ε^{-2}) lightness. 
    more » « less
  4. Morin, Pat; Oh, Eunjin (Ed.)
    Let S be a set of n points in ℝ^d, where d ≥ 2 is a constant, and let H₁,H₂,…,H_{m+1} be a sequence of vertical hyperplanes that are sorted by their first coordinates, such that exactly n/m points of S are between any two successive hyperplanes. Let |A(S,m)| be the number of different closest pairs in the {(m+1) choose 2} vertical slabs that are bounded by H_i and H_j, over all 1 ≤ i < j ≤ m+1. We prove tight bounds for the largest possible value of |A(S,m)|, over all point sets of size n, and for all values of 1 ≤ m ≤ n. As a result of these bounds, we obtain, for any constant ε > 0, a data structure of size O(n), such that for any vertical query slab Q, the closest pair in the set Q ∩ S can be reported in O(n^{1/2+ε}) time. Prior to this work, no linear space data structure with sublinear query time was known. 
    more » « less
  5. Bodlaender, Hans L (Ed.)
    Given a set P ⊂ ℝ^d of n points, with diameter Δ, and a parameter δ ∈ (0,1), it is known that there is a partition of P into sets P_1, …, P_t, each of size O(1/δ²), such that their convex hulls all intersect a common ball of radius δΔ. We prove that a random partition, with a simple alteration step, yields the desired partition, resulting in a (randomized) linear time algorithm (i.e., O(dn)). We also provide a deterministic algorithm with running time O(dn log n). Previous proofs were either existential (i.e., at least exponential time), or required much bigger sets. In addition, the algorithm and its proof of correctness are significantly simpler than previous work, and the constants are slightly better. We also include a number of applications and extensions using the same central ideas. For example, we provide a linear time algorithm for computing a "fuzzy" centerpoint, and prove a no-dimensional weak ε-net theorem with an improved constant. 
    more » « less