skip to main content


Title: Partitioning Communication Streams Into Graph Snapshots
We present EASEE (Edge Advertisements into Snapshots using Evolving Expectations) for partitioning streaming communication data into static graph snapshots. Given streaming communication events (A talks to B), EASEE identifies when events suffice for a static graph (a snapshot ). EASEE uses combinatorial statistical models to adaptively find when a snapshot is stable, while watching for significant data shifts – indicating a new snapshot should begin. If snapshots are not found carefully, they poorly represent the underlying data – and downstream graph analytics fail: We show a community detection example. We demonstrate EASEE's strengths against several real-world datasets, and its accuracy against known-answer synthetic datasets. Synthetic datasets' results show that (1) EASEE finds known-answer data shifts very quickly; and (2) ignoring these shifts drastically affects analytics on resulting snapshots. We show that previous work misses these shifts. Further, we evaluate EASEE against seven real-world datasets (330 K to 2.5B events), and find snapshot-over-time behaviors missed by previous works. Finally, we show that the resulting snapshots' measured properties (e.g., graph density) are altered by how snapshots are identified from the communication event stream. In particular, EASEE's snapshots do not generally “densify” over time, contradicting previous influential results that used simpler partitioning methods.  more » « less
Award ID(s):
1916084
NSF-PAR ID:
10388359
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
IEEE Transactions on Network Science and Engineering
ISSN:
2334-329X
Page Range / eLocation ID:
1 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Balanced graph partitioning is a critical step for many large-scale distributed computations with relational data. As graph datasets have grown in size and density, a range of highly-scalable balanced partitioning algorithms have appeared to meet varied demands across different domains. As the starting point for the present work, we observe that two recently introduced families of iterative partitioners---those based on restreaming and those based on balanced label propagation (including Facebook's Social Hash Partitioner)---can be viewed through a common modular framework of design decisions. With the help of this modular perspective, we find that a key combination of design decisions leads to a novel family of algorithms with notably better empirical performance than any existing highly-scalable algorithm on a broad range of real-world graphs. The resulting prioritized restreaming algorithms employ a constraint management strategy based on multiplicative weights, borrowed from the restreaming literature, while adopting notions of priority from balanced label propagation to optimize the ordering of the streaming process. Our experimental results consider a range of stream orders, where a dynamic ordering based on what we call ambivalence is broadly the most performative in terms of the cut quality of the resulting balanced partitions, with a static ordering based on degree being nearly as good. 
    more » « less
  2. There has been a growing interest in the graph-streaming setting where a continuous stream of graph updates is mixed with graph queries. In principle, purely-functional trees are an ideal fit for this setting as they enable safe parallelism, lightweight snapshots, and strict serializability for queries. However, directly using them for graph processing leads to significant space overhead and poor cache locality. This paper presents C-trees, a compressed purely-functional search tree data structure that significantly improves on the space usage and locality of purely-functional trees. We design theoretically-efficient and practical algorithms for performing batch updates to C-trees, and also show that we can store massive dynamic real-world graphs using only a few bytes per edge, thereby achieving space usage close to that of the best static graph processing frameworks. To study the applicability of our data structure, we designed Aspen, a graph-streaming framework that extends the interface of Ligra with operations for updating graphs. We show that Aspen is faster than two state-of-the-art graph-streaming systems, Stinger and LLAMA, while requiring less memory, and is competitive in performance with the state-of-the-art static graph frameworks, Galois, GAP, and Ligra+. With Aspen, we are able to efficiently process the largest publicly-available graph with over two hundred billion edges in the graph-streaming setting using a single commodity multicore server with 1TB of memory. 
    more » « less
  3. We give an $\widetilde{O}(\sqrt{n})$-space single-pass 0.483-approximation streaming algorithm for estimating the maximum directed cut size (Max-DICUT) in a directed graph on n vertices. This improves over an $O(\log n)$-space $4 / 9 < 0.45$ approximation algorithm due to Chou, Golovnev, and Velusamy (FOCS 2020), which was known to be optimal for $o(\sqrt{n})$-space algorithms. Max-DICUT is a special case of a constraint satisfaction problem (CSP). In this broader context, we give the first CSP for which algorithms with $\widetilde{O}(\sqrt{n})$- space can provably outperform $o(\sqrt{n})$- space algorithms. The key technical contribution of our work is development of the notions of a first-order snapshot of a (directed) graph and of estimates of such snapshots. These snapshots can be used to simulate certain (non-streaming) Max-DICUT algorithms, including the “oblivious” algorithms introduced by Feige and Jozeph (Algorithmica, 2015), who showed that one such algorithm Previous work of the authors (SODA 2023) studied the restricted case of bounded-degree graphs, and observed that in this setting, it is straightforward to estimate the snapshot with $\ell_{1}$ errors and this suffices to simulate oblivious algorithms. But for unbounded-degree graphs, even defining an achievable and sufficient notion of estimation is subtle. We describe a new notion of snapshot estimation and prove its sufficiency using careful smoothing techniques, and then develop an algorithm which sketches such an estimate via a delicate process of intertwined vertex- and edge-subsampling. Prior to our work, the only streaming algorithms for any CSP on general instances were based on generalizations of the $O(\log n)$-space algorithm for Max-DICUT, and can roughly be characterized as based on “zeroth” order snapshots. Our work thus opens the possibility of a new class of algorithms for approximating CSPs by demonstrating that more sophisticated snapshots can outperform cruder ones in the case of Max-DICUT. 
    more » « less
  4. Today’s large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a “sketch of sketches” to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times. 
    more » « less
  5. Recent advances in Graph Neural Networks (GNNs) have changed the landscape of modern graph analytics. The complexity of GNN training and the scalability challenges have also sparked interest from the systems community, with efforts to build systems that provide higher efficiency and schemes to reduce costs. However, we observe that many such systems basically reinvent the wheel of much work done in the database world on scalable graph analytics engines. Further, they often tightly couple the scalability treatments of graph data processing with that of GNN training, resulting in entangled complex problems and systems that often do not scale well on one of those axes.

    In this paper, we ask a fundamental question: How far can we push existing systems for scalable graph analytics and deep learning (DL) instead of building custom GNN systems? Are compromises inevitable on scalability and/or runtimes? We propose Lotan, the first scalable and optimized data system for full-batch GNN training withdecoupled scalingthat bridges the hitherto siloed worlds of graph analytics systems and DL systems. Lotan offers a series of technical innovations, including re-imagining GNN training as query plan-like dataflows, execution plan rewriting, optimized data movement between systems, a GNN-centric graph partitioning scheme, and the first known GNN model batching scheme. We prototyped Lotan on top of GraphX and PyTorch. An empirical evaluation using several real-world benchmark GNN workloads reveals a promising nuanced picture: Lotan significantly surpasses the scalability of state-of-the-art custom GNN systems, while often matching or being only slightly behind on time-to-accuracy metrics in some cases. We also show the impact of our system optimizations. Overall, our work shows that the GNN world can indeed benefit from building on top of scalable graph analytics engines. Lotan's new level of scalability can also empower new ML-oriented research on ever-larger graphs and GNNs. 

    more » « less