We consider the problem of sampling and approximately counting an arbitrary given motif H in a graph G, where access to G is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of H into a collection of odd cycles and stars, denoted D^*(H) = {O_{k₁},...,O_{k_q}, S_{p₁},...,S_{p_𝓁}}. These algorithms were shown to be optimal for the case where H is a clique or an oddlength cycle, but no other lower bounds were known.
We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, is always at least as good, and for most graphs G is strictly better. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the pth moment of the degree distribution.
Finally, we prove that this algorithm is decompositionoptimal for decompositions that contain at least one odd cycle. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.
more »
« less
Towards a DecompositionOptimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time
Counting and uniformly sampling motifs in a graph are fundamental algorithmic tasks with numerous applications across multiple fields. Since these problems are computationally expensive, recent efforts have focused on devising sublineartime algorithms for these problems. We consider the model where the algorithm gets a constant size motif H and query access to a graph G, where the allowed queries are degree, neighbor, and pair queries, as well as uniform edge sample queries. In the sampling task, the algorithm is required to output a uniformly distributed copy of H in G (if one exists), and in the counting task it is required to output a good estimate to the number of copies of H in G.
Previous algorithms for the uniform sampling task were based on a decomposition of H into a collection of odd cycles and stars, denoted D∗(H) = {Ok1 , ...,Okq , Sp1 , ..., Spℓ19 }. These algorithms were shown to be optimal for the case where H is a clique or an oddlength cycle, but no other lower bounds were known.
We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, for any motif H whose decomposition contains at least two components or at least one star, is always preferable. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the pth moment of the degree distribution. We further show how to use our sampling algorithm to get an approximate counting algorithm, with essentially the same complexity.
Finally, we prove that this algorithm is decompositionoptimal for decompositions that contain at least one odd cycle. That is, we prove that for any decomposition D that contains at least one odd cycle, there exists a motif HD 30 with decomposition D, and a family of graphs G, so that in order to output a uniform copy of H in a uniformly chosen graph in G, the number of required queries matches our upper bound. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.
more »
« less
 NSFPAR ID:
 10279839
 Date Published:
 Journal Name:
 Random
 ISSN:
 11988193
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study the problem of estimating the value of sums of the form Sp≜∑(xip) when one has the ability to sample xi≥0 with probability proportional to its magnitude. When p=2, this problem is equivalent to estimating the selectivity of a selfjoin query in database systems when one can sample rows randomly. We also study the special case when {xi} is the degree sequence of a graph, which corresponds to counting the number of pstars in a graph when one has the ability to sample edges randomly. Our algorithm for a (1±ε)multiplicative approximation of Sp has query and time complexities O(mloglognϵ2S1/pp). Here, m=∑xi/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when {xi} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where Sp≤n, and p=2, our upper bound is O~(n/S1/2p), in contrast to their Ω(n/S1/3p) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between indegree and outdegree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.more » « less

We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise εclose to the uniform distribution, in an amortizedefficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ^*(n/√ m) for sampling a single edge in general graphs (where O^*(⋅) suppresses poly(1/ε) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized persample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O^*(√ q ⋅(n/√ m)), which is strictly preferable to O^*(q⋅ (n/√ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal.more » « less

We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise ϵclose to the uniform distribution, in an amortizedefficient fashion. We consider the adjacency list query model, where access to a graph G is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let n and m denote the number of vertices and edges of G, respectively. Eden and Rosenbaum provided upper and lower bounds of Θ∗(n/ √ m) for sampling a single edge in general graphs (where O ∗(·) suppresses poly(1/ϵ) and poly(log n) dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized persample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples q in advance, has an overall cost that is sublinear in q, namely, O∗(√ q · (n/ √ m)), which is strictly preferable to O∗(q · (n/ √ m)) cost resulting from q invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal.more » « less

We study counting problems for several types of orientations of chordal graphs: sourcesinkfree orientations, sinkfree orientations, acyclic orientations, and bipolar orientations, and, for the latter two, we also present lineartime uniform samplers. Counting sinkfree, acyclic, or bipolar orientations are known to be #Pcomplete for general graphs, motivating our study on a restricted, yet wellstudied, graph class. Our main focus is sourcesinkfree orientations, a natural restricted version of sinkfree orientations related to strong orientations, which we introduce in this work. These orientations are intriguing, since despite their similarity, currently known FPRAS and sampling techniques (such as Markov chains or sinkpopping) that apply to sinkfree orientations do not seem to apply to sourcesinkfree orientations. We present fast polynomialtime algorithms counting these orientations on chordal graphs. Our approach combines dynamic programming with inclusionexclusion (going two levels deep for sourcesinkfree orientations and one level for sinkfree orientations) throughout the computation. Dynamic programming counting algorithms can be typically used to produce a uniformly random sample. However, due to the negative terms of the inclusionexclusion, the typical approach to obtain a polynomialtime sampling algorithm does not apply in our case. Obtaining such an almost uniform sampling algorithm for sourcesinkfree orientations in chordal graphs remains an open problem. Little is known about counting or sampling of acyclic or bipolar orientations, even on restricted graph classes. We design efficient (lineartime) exact uniform sampling algorithms for these orientations on chordal graphs. These algorithms are a byproduct of our counting algorithms, but unlike in other works that provide dynamicprogrammingbased samplers, we produce a random orientation without computing the corresponding count, which leads to a faster running time than the counting algorithm (since it avoids manipulation of large integers).more » « less