Computing Data Distribution from Query Selectivities

Agarwal, Pankaj K; Raychaudhury, Rahul; Sintos, Stavros; Yang, Jun

doi:10.4230/LIPIcs.ICDT.2024.18

Citation Details

Computing Data Distribution from Query Selectivities

We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely. more »

Award ID(s):: 2402823

PAR ID:: 10616164

Author(s) / Creator(s):: Agarwal, Pankaj K; Raychaudhury, Rahul; Sintos, Stavros; Yang, Jun

Editor(s):: Cormode, Graham; Shekelyan, Michael

Publisher / Repository:: Schloss Dagstuhl – Leibniz-Zentrum für Informatik

Date Published:: 2024-01-01

Volume:: 290

ISSN:: 1868-8969

ISBN:: 978-3-95977-312-6

Page Range / eLocation ID:: 18:1-18:20

Subject(s) / Keyword(s):: selectivity queries discrete distributions Multiplicative Weights Update eps-approximation learnable functions depth problem arrangement Theory of computation → Computational geometry

Format(s):: Medium: X Size: 20 pages; 929722 bytes Other: application/pdf

Size(s):: 20 pages 929722 bytes

Right(s):: Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.4230/LIPIcs.ICDT.2024.18

More Like this