skip to main content


Title: Towards a Decomposition-Optimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time
We consider the problem of sampling and approximately counting an arbitrary given motif H in a graph G, where access to G is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of H into a collection of odd cycles and stars, denoted D^*(H) = {O_{k₁},...,O_{k_q}, S_{p₁},...,S_{p_𝓁}}. These algorithms were shown to be optimal for the case where H is a clique or an odd-length cycle, but no other lower bounds were known. We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, is always at least as good, and for most graphs G is strictly better. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the p-th moment of the degree distribution. Finally, we prove that this algorithm is decomposition-optimal for decompositions that contain at least one odd cycle. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.  more » « less
Award ID(s):
2006664
NSF-PAR ID:
10338043
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
RANDOM 2021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Counting and uniformly sampling motifs in a graph are fundamental algorithmic tasks with numerous applications across multiple fields. Since these problems are computationally expensive, recent efforts have focused on devising sublinear-time algorithms for these problems. We consider the model where the algorithm gets a constant size motif H and query access to a graph G, where the allowed queries are degree, neighbor, and pair queries, as well as uniform edge sample queries. In the sampling task, the algorithm is required to output a uniformly distributed copy of H in G (if one exists), and in the counting task it is required to output a good estimate to the number of copies of H in G. Previous algorithms for the uniform sampling task were based on a decomposition of H into a collection of odd cycles and stars, denoted D∗(H) = {Ok1 , ...,Okq , Sp1 , ..., Spℓ19 }. These algorithms were shown to be optimal for the case where H is a clique or an odd-length cycle, but no other lower bounds were known. We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, for any motif H whose decomposition contains at least two components or at least one star, is always preferable. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the p-th moment of the degree distribution. We further show how to use our sampling algorithm to get an approximate counting algorithm, with essentially the same complexity. Finally, we prove that this algorithm is decomposition-optimal for decompositions that contain at least one odd cycle. That is, we prove that for any decomposition D that contains at least one odd cycle, there exists a motif HD 30 with decomposition D, and a family of graphs G, so that in order to output a uniform copy of H in a uniformly chosen graph in G, the number of required queries matches our upper bound. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition. 
    more » « less
  2. We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ε-close to the uniform distribution, in an amortized-efficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ^*(n/√ m) for sampling a single edge in general graphs (where O^*(⋅) suppresses poly(1/ε) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O^*(√ q ⋅(n/√ m)), which is strictly preferable to O^*(q⋅ (n/√ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. 
    more » « less
  3. We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ϵ-close to the uniform distribution, in an amortized-efficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ∗(n/ √ m) for sampling a single edge in general graphs (where O ∗(·) suppresses poly(1/ϵ) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O∗(√ q · (n/ √ m)), which is strictly preferable to O∗(q · (n/ √ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. 
    more » « less
  4. We present a new technique for efficiently removing almost all short cycles in a graph without unintentionally removing its triangles. Consequently, triangle finding problems do not become easy even in almost k-cycle free graphs, for any constant k≥ 4. Triangle finding is at the base of many conditional lower bounds in P, mainly for distance computation problems, and the existence of many 4- or 5-cycles in a worst-case instance had been the obstacle towards resolving major open questions. Hardness of approximation: Are there distance oracles with m1+o(1) preprocessing time and mo(1) query time that achieve a constant approximation? Existing algorithms with such desirable time bounds only achieve super-constant approximation factors, while only 3− factors were conditionally ruled out (Pătraşcu, Roditty, and Thorup; FOCS 2012). We prove that no O(1) approximations are possible, assuming the 3-SUM or APSP conjectures. In particular, we prove that k-approximations require Ω(m1+1/ck) time, which is tight up to the constant c. The lower bound holds even for the offline version where we are given the queries in advance, and extends to other problems such as dynamic shortest paths. The 4-Cycle problem: An infamous open question in fine-grained complexity is to establish any surprising consequences from a subquadratic or even linear-time algorithm for detecting a 4-cycle in a graph. This is arguably one of the simplest problems without a near-linear time algorithm nor a conditional lower bound. We prove that Ω(m1.1194) time is needed for k-cycle detection for all k≥ 4, unless we can detect a triangle in √n-degree graphs in O(n2−δ) time; a breakthrough that is not known to follow even from optimal matrix multiplication algorithms. 
    more » « less
  5. We study the problem of estimating the value of sums of the form Sp≜∑(xip) when one has the ability to sample xi≥0 with probability proportional to its magnitude. When p=2, this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when {xi} is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a (1±ε)-multiplicative approximation of Sp has query and time complexities O(mloglognϵ2S1/pp). Here, m=∑xi/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when {xi} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where Sp≤n, and p=2, our upper bound is O~(n/S1/2p), in contrast to their Ω(n/S1/3p) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case. 
    more » « less