skip to main content


Title: Data Structures for Density Estimation
We study statistical/computational tradeoffs for the following density estimation problem: given kdistributionsv1,...,vk overadiscretedomain of size n, and sampling access to a distribution p, identify vi that is “close” to p. Our main result is the first data structure that, given a sublinear (in n) number of samples from p, identifies vi in time sublinear in k. We also give an improved version of the algorithm of (Acharya et al., 2018) that reports vi in time linear in k. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work.  more » « less
Award ID(s):
2022448 2006798
NSF-PAR ID:
10430248
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Proceedings of Machine Learning Research
Date Published:
Journal Name:
International Conference on Machine Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tauman Kalai, Yael (Ed.)
    The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k. 
    more » « less
  2. null (Ed.)
    We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter. 
    more » « less
  3. Given a set $P$ of $n$ points in the plane, we consider the problem of computing the number of points of $P$ in a query unit disk (i.e., all query disks have the same radius). We show that the main techniques for simplex range searching in the plane can be adapted to this problem. For example, by adapting Matoušek's results, we can build a data structure of $O(n)$ space in $O(n^{1+\delta})$ time (for any $\delta>0$) so that each query can be answered in $O(\sqrt{n})$ time; alternatively, we can build a data structure of $O(n^2/\log^2 n)$ space with $O(n^{1+\delta})$ preprocessing time (for any $\delta>0$) and $O(\log n)$ query time. Our techniques lead to improvements for several other classical problems in computational geometry. 1. Given a set of $n$ unit disks and a set of $n$ points in the plane, the batched unit-disk range counting problem is to compute for each disk the number of points in it. Previous work [Katz and Sharir, 1997] solved the problem in $O(n^{4/3}\log n)$ time. We give a new algorithm of $O(n^{4/3})$ time, which is optimal as it matches an $\Omega(n^{4/3})$-time lower bound. For small $\chi$, where $\chi$ is the number of pairs of unit disks that intersect, we further improve the algorithm to $O(n^{2/3}\chi^{1/3}+n^{1+\delta})$ time, for any $\delta>0$. 2. The above result immediately leads to an $O(n^{4/3})$ time optimal algorithm for counting the intersecting pairs of circles for a set of $n$ unit circles in the plane. The previous best algorithms solve the problem in $O(n^{4/3}\log n)$ deterministic time [Katz and Sharir, 1997] or in $O(n^{4/3}\log^{2/3} n)$ expected time by a randomized algorithm [Agarwal, Pellegrini, and Sharir, 1993]. 3. Given a set $P$ of $n$ points in the plane and an integer $k$, the distance selection problem is to find the $k$-th smallest distance among all pairwise distances of $P$. The problem can be solved in $O(n^{4/3}\log^2 n)$ deterministic time [Katz and Sharir, 1997] or in $O(n\log n+n^{2/3}k^{1/3}\log^{5/3}n)$ expected time by a randomized algorithm [Chan, 2001]. Our new randomized algorithm runs in $O(n\log n +n^{2/3}k^{1/3}\log n)$ expected time. 4. Given a set $P$ of $n$ points in the plane, the discrete $2$-center problem is to compute two smallest congruent disks whose centers are in $P$ and whose union covers $P$. An $O(n^{4/3}\log^5 n)$-time algorithm was known [Agarwal, Sharir, and Welzl, 1998]. Our techniques yield a deterministic algorithm of $O(n^{4/3}\log^{10/3} n\cdot (\log\log n)^{O(1)})$ time and a randomized algorithm of $O(n^{4/3}\log^3 n\cdot (\log\log n)^{1/3})$ expected time. 
    more » « less
  4. We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ϵ-close to the uniform distribution, in an amortized-efficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ∗(n/ √ m) for sampling a single edge in general graphs (where O ∗(·) suppresses poly(1/ϵ) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O∗(√ q · (n/ √ m)), which is strictly preferable to O∗(q · (n/ √ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. 
    more » « less
  5. We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ε-close to the uniform distribution, in an amortized-efficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ^*(n/√ m) for sampling a single edge in general graphs (where O^*(⋅) suppresses poly(1/ε) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O^*(√ q ⋅(n/√ m)), which is strictly preferable to O^*(q⋅ (n/√ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. 
    more » « less