We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x1,...,xn. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worstcase, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worstcase is with respect to the data values x1,...,xn. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–where the weights are functions of the distribution P and the sample and target sets A, Band show that the worstcase expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms.more »
Learningbased Support Estimation in Sublinear Time
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log2(1/ε)·n/logn), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machinelearningbased predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log(1/ε)·n1−Θ(1/log(1/ε)).We evaluate the proposed algorithms on a collection of data sets, using the neuralnetwork based estimators from Hsu et al, ICLR’19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
 Publication Date:
 NSFPAR ID:
 10275616
 Journal Name:
 International Conference on Learning Representations
 Sponsoring Org:
 National Science Foundation
More Like this


There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in amore »

We give two new quantum algorithms for solving semidefinite programs (SDPs) providing quantum speedups. We consider SDP instances with m constraint matrices, each of dimension n, rank at most r, and sparsity s. The first algorithm assumes an input model where one is given access to an oracle to the entries of the matrices at unit cost. We show that it has run time O~(s^2 (sqrt{m} epsilon^{10} + sqrt{n} epsilon^{12})), with epsilon the error of the solution. This gives an optimal dependence in terms of m, n and quadratic improvement over previous quantum algorithms (when m ~~ n). The second algorithm assumes a fully quantum input model in which the input matrices are given as quantum states. We show that its run time is O~(sqrt{m}+poly(r))*poly(log m,log n,B,epsilon^{1}), with B an upper bound on the tracenorm of all input matrices. In particular the complexity depends only polylogarithmically in n and polynomially in r. We apply the second SDP solver to learn a good description of a quantum state with respect to a set of measurements: Given m measurements and a supply of copies of an unknown state rho with rank at most r, we show we can find in time sqrt{m}*poly(logmore »

We study the classic set cover problem from the perspective of sublinear algorithms. Given access to a collection of m sets over n elements in the query model, we show that sublinear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in [17] to the sublinear query model, that returns an αapproximate cover using Õ(m(n/k)^1/(α–1) + nk) queries to the input, where k denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of k, the required number of queries is , even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require Ω(nk) queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter k. On the other hand, we show that this bound is not optimal for larger values of the parameter k, as there exists a (1 + ε)approximation algorithm with Õ(mn/kε^2) queries. We show that this bound is essentially tight for sufficiently small constant ε, by establishing amore »

We study the problem of estimating the value of sums of the form Sp≜∑(xip) when one has the ability to sample xi≥0 with probability proportional to its magnitude. When p=2, this problem is equivalent to estimating the selectivity of a selfjoin query in database systems when one can sample rows randomly. We also study the special case when {xi} is the degree sequence of a graph, which corresponds to counting the number of pstars in a graph when one has the ability to sample edges randomly. Our algorithm for a (1±ε)multiplicative approximation of Sp has query and time complexities O(mloglognϵ2S1/pp). Here, m=∑xi/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when {xi} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem,more »