Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log2(1/ε)·n/logn), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machinelearningbased predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log(1/ε)·n1−Θ(1/log(1/ε)).We evaluate the proposed algorithms on a collection of data sets, using the neuralnetwork based estimators from Hsu et al, ICLR’19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.more » « less

Counting and uniformly sampling motifs in a graph are fundamental algorithmic tasks with numerous applications across multiple fields. Since these problems are computationally expensive, recent efforts have focused on devising sublineartime algorithms for these problems. We consider the model where the algorithm gets a constant size motif H and query access to a graph G, where the allowed queries are degree, neighbor, and pair queries, as well as uniform edge sample queries. In the sampling task, the algorithm is required to output a uniformly distributed copy of H in G (if one exists), and in the counting task it is required to output a good estimate to the number of copies of H in G. Previous algorithms for the uniform sampling task were based on a decomposition of H into a collection of odd cycles and stars, denoted D∗(H) = {Ok1 , ...,Okq , Sp1 , ..., Spℓ19 }. These algorithms were shown to be optimal for the case where H is a clique or an oddlength cycle, but no other lower bounds were known. We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, for any motif H whose decomposition contains at least two components or at least one star, is always preferable. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the pth moment of the degree distribution. We further show how to use our sampling algorithm to get an approximate counting algorithm, with essentially the same complexity. Finally, we prove that this algorithm is decompositionoptimal for decompositions that contain at least one odd cycle. That is, we prove that for any decomposition D that contains at least one odd cycle, there exists a motif HD 30 with decomposition D, and a family of graphs G, so that in order to output a uniform copy of H in a uniformly chosen graph in G, the number of required queries matches our upper bound. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.more » « less

null (Ed.)We study fast algorithms for computing basic properties of an n x n positive semidefinite kernel matrix K corresponding to n points x_1,...,x_n in R^d. In particular, we consider the estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. These are some of the most basic problems defined over kernel matrices. We show that the sum of matrix entries can be estimated up to a multiplicative factor of 1+epsilon in time sublinear in n and linear in d for many popular kernel functions, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and a witnessing approximate eigenvector) can be approximated to a multiplicative factor of 1+epsilon in time subquadratic in n and linear in d. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.more » « less

Random dimensionality reduction is a versatile tool for speeding up algorithms for highdimensional problems. We study its application to two clustering problems: the facility location problem, and the singlelinkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of 𝑘means and 𝑘medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.more » « less

We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ϵclose to the uniform distribution, in an amortizedefficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ∗(n/ √ m) for sampling a single edge in general graphs (where O ∗(·) suppresses poly(1/ϵ) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized persample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O∗(√ q · (n/ √ m)), which is strictly preferable to O∗(q · (n/ √ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal.more » « less