skip to main content

Title: Triangle and Four Cycle Counting with Predictions in Graph Streams
We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, Hsu et al. (2019a) and Jiang et al. (2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior “classical” algorithms that did not use oracles. In this paper, we explore the power of a “heavy edge” oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on “classical” streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.  more » « less
Award ID(s):
2006664 2022448 2006798
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Tenth International Conference on Learning Representations (ICLR 2022)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study graph computations in an enhanced data streaming setting, where a space-bounded client reading the edge stream of a massive graph may delegate some of its work to a cloud service. We seek algorithms that allow the client to verify a purported proof sent by the cloud service that the work done in the cloud is correct. A line of work starting with Chakrabarti et al. (ICALP 2009) has provided such algorithms, which we call schemes, for several statistical and graph-theoretic problems, many of which exhibit a tradeoff between the length of the proof and the space used by the streaming verifier. This work designs new schemes for a number of basic graph problems - including triangle counting, maximum matching, topological sorting, and single-source shortest paths - where past work had either failed to obtain smooth tradeoffs between these two key complexity measures or only obtained suboptimal tradeoffs. Our key innovation is having the verifier compute certain nonlinear sketches of the input stream, leading to either new or improved tradeoffs. In many cases, our schemes, in fact, provide optimal tradeoffs up to logarithmic factors. Specifically, for most graph problems that we study, it is known that the product of the verifier’s space cost v and the proof length h must be at least Omega(n^2) for n-vertex graphs. However, matching upper bounds are only known for a handful of settings of h and v on the curve h*v = ~Theta(n^2). For example, for counting triangles and maximum matching, schemes with costs lying on this curve are only known for (h = ~O(n²), v = ~O(1)), (h = ~O(n), v = ~O(n)), and the trivial (h = ~O(1), v = ~O(n²)). A major message of this work is that by exploiting nonlinear sketches, a significant "portion" of costs on the tradeoff curve h*v=n^2 can be achieved. 
    more » « less
  2. We revisit the much-studied problem of space-efficiently estimating the number of triangles in a graph stream, and extensions of this problem to counting fixed-sized cliques and cycles, obtaining a number of new upper and lower bounds. For the important special case of counting triangles, we give a $4$-pass, $(1\pm\varepsilon)$-approximate, randomized algorithm that needs at most $\widetilde{O}(\varepsilon^{-2}\cdot m^{3/2}/T)$ space, where $m$ is the number of edges and $T$ is a promised lower bound on the number of triangles. This matches the space bound of a very recent algorithm (McGregor et al., PODS 2016), with an arguably simpler and more general technique. We give an improved multi-pass lower bound of $\Omega(\min\{m^{3/2}/T, m/\sqrt{T}\})$, applicable at essentially all densities $\Omega(n) \le m \le O(n^2)$. We also prove other multi-pass lower bounds in terms of various structural parameters of the input graph. Together, our results resolve a couple of open questions raised in recent work (Braverman et al., ICALP 2013). Our presentation emphasizes more general frameworks, for both upper and lower bounds. We give a sampling algorithm for counting arbitrary subgraphs and then improve it via combinatorial means in the special cases of counting odd cliques and odd cycles. Our results show that these problems are considerably easier in the cash-register streaming model than in the turnstile model, where previous work had focused (Manjunath et al., ESA 2011; Kane et al., ICALP 2012). We use Tur{\'a}n graphs and related gadgets to derive lower bounds for counting cliques and cycles, with triangle-counting lower bounds following as a corollary. 
    more » « less
  3. We consider the problem of space-efficiently estimating the number of simplices in a hypergraph stream. This is the most natural hypergraph generalization of the highly-studied problem of estimating the number of triangles in a graph stream. Our input is a k-uniform hypergraph H with n vertices and m hyperedges, each hyperedge being a k-sized subset of vertices. A k-simplex in H is a subhypergraph on k+1 vertices X such that all k+1 possible hyperedges among X exist in H. The goal is to process the hyperedges of H, which arrive in an arbitrary order as a data stream, and compute a good estimate of T_k(H), the number of k-simplices in H. We design a suite of algorithms for this problem. As with triangle-counting in graphs (which is the special case k = 2), sublinear space is achievable but only under a promise of the form T_k(H) ≥ T. Under such a promise, our algorithms use at most four passes and together imply a space bound of O(ε^{-2} log δ^{-1} polylog n ⋅ min{(m^{1+1/k})/T, m/(T^{2/(k+1)})}) for each fixed k ≥ 3, in order to guarantee an estimate within (1±ε)T_k(H) with probability ≥ 1-δ. We also give a simpler 1-pass algorithm that achieves O(ε^{-2} log δ^{-1} log n⋅ (m/T) (Δ_E + Δ_V^{1-1/k})) space, where Δ_E (respectively, Δ_V) denotes the maximum number of k-simplices that share a hyperedge (respectively, a vertex), which generalizes a previous result for the k = 2 case. We complement these algorithmic results with space lower bounds of the form Ω(ε^{-2}), Ω(m^{1+1/k}/T), Ω(m/T^{1-1/k}) and Ω(mΔ_V^{1/k}/T) for multi-pass algorithms and Ω(mΔ_E/T) for 1-pass algorithms, which show that some of the dependencies on parameters in our upper bounds are nearly tight. Our techniques extend and generalize several different ideas previously developed for triangle counting in graphs, using appropriate innovations to handle the more complicated combinatorics of hypergraphs. 
    more » « less
  4. Memory-hard functions (MHFs) are a key cryptographic primitive underlying the design of moderately expensive password hashing algorithms and egalitarian proofs of work. Over the past few years several increasingly stringent goals for an MHF have been proposed including the requirement that the MHF have high sequential space-time (ST) complexity, parallel space-time complexity, amortized area-time (aAT) complexity and sustained space complexity. Data-Independent Memory Hard Functions (iMHFs) are of special interest in the context of password hashing as they naturally resist side-channel attacks. iMHFs can be specified using a directed acyclic graph (DAG) $G$ with $N=2^n$ nodes and low indegree and the complexity of the iMHF can be analyzed using a pebbling game. Recently, Alwen et al. [CCS'17] constructed an DAG called DRSample which has aAT complexity at least $\Omega\left( N^2/\log N\right)$. Asymptotically DRSample outperformed all prior iMHF constructions including Argon2i, winner of the password hashing competition (aAT cost $\mathcal{O}\left(N^{1.767}\right)$), though the constants in these bounds are poorly understood. We show that the the greedy pebbling strategy of Boneh et al. [ASIACRYPT'16] is particularly effective against DRSample e.g., the aAT cost is $\mathcal{O}\left( N^2/\log N\right)$. In fact, our empirical analysis {\em reverses} the prior conclusion of Alwen et al. that DRSample provides stronger resistance to known pebbling attacks for practical values of $N \leq 2^{24}$. We construct a new iMHF candidate (DRSample+BRG) by using the bit-reversal graph to extend DRSample. We then prove that the construction is asymptotically optimal under every MHF criteria, and we empirically demonstrate that our iMHF provides the best resistance to {\em known} pebbling attacks. For example, we show that any parallel pebbling attack either has aAT cost $\omega(N^2)$ or requires at least $\Omega(N)$ steps with $\Omega(N/\log N)$ pebbles on the DAG. This makes our construction the first practical iMHF with a strong sustained space-complexity guarantee and immediately implies that any parallel pebbling has aAT complexity $\Omega(N^2/\log N)$. We also prove that any sequential pebbling (including the greedy pebbling attack) has aAT cost $\Omega\left( N^2\right)$ and, if a plausible conjecture holds, any parallel pebbling has aAT cost $\Omega(N^2 \log \log N/\log N)$ --- the best possible bound for an iMHF. We implement our new iMHF and demonstrate that it is just as fast as Argon2. Along the way we propose a simple modification to the Argon2 round function which increases an attacker's aAT cost by nearly an order of magnitude without increasing running time on a CPU. Finally, we give a pebbling reduction which proves that in the parallel random oracle model (PROM) the cost of evaluating an iMHF like Argon2i or DRSample+BRG is given by the pebbling cost of the underlying DAG. Prior pebbling reductions assumed that the iMHF round function concatenates input labels before hashing and did not apply to practical iMHFs such as Argon2i, DRSample or DRSample+BRG where input labels are instead XORed together. 
    more » « less
  5. We initiate the study of biologically-inspired spiking neural networks from the perspective of streaming algorithms. Like computers, human brains face memory limitations, which pose a significant obstacle when processing large scale and dynamically changing data. In computer science, these challenges are captured by the well-known streaming model, which can be traced back to Munro and Paterson `78 and has had significant impact in theory and beyond. In the classical streaming setting, one must compute a function f of a stream of updates 𝒮 = {u₁,…,u_m}, given restricted single-pass access to the stream. The primary complexity measure is the space used by the algorithm. In contrast to the large body of work on streaming algorithms, relatively little is known about the computational aspects of data processing in spiking neural networks. In this work, we seek to connect these two models, leveraging techniques developed for streaming algorithms to better understand neural computation. Our primary goal is to design networks for various computational tasks using as few auxiliary (non-input or output) neurons as possible. The number of auxiliary neurons can be thought of as the "space" required by the network. Previous algorithmic work in spiking neural networks has many similarities with streaming algorithms. However, the connection between these two space-limited models has not been formally addressed. We take the first steps towards understanding this connection. On the upper bound side, we design neural algorithms based on known streaming algorithms for fundamental tasks, including distinct elements, approximate median, and heavy hitters. The number of neurons in our solutions almost match the space bounds of the corresponding streaming algorithms. As a general algorithmic primitive, we show how to implement the important streaming technique of linear sketching efficiently in spiking neural networks. On the lower bound side, we give a generic reduction, showing that any space-efficient spiking neural network can be simulated by a space-efficient streaming algorithm. This reduction lets us translate streaming-space lower bounds into nearly matching neural-space lower bounds, establishing a close connection between the two models. 
    more » « less