Learning-based Support Estimation in Sublinear Time

Eden, Talya; Indyk, Piotr; Narayanan, Shyam; Rubinfeld, Ronitt; Silwal, Sandeep; Wagner, Tal

Citation Details

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log2(1/ε)·n/logn), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log(1/ε)·n1−Θ(1/log(1/ε)).We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR’19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy com-pared to the state of the art algorithm. more »

Award ID(s):: 2006664 2022448 1740751

NSF-PAR ID:: 10275616

Author(s) / Creator(s):: Eden, Talya; Indyk, Piotr; Narayanan, Shyam; Rubinfeld, Ronitt; Silwal, Sandeep; Wagner, Tal

Date Published:: 2021-01-01

Journal Name:: International Conference on Learning Representations

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this