skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: G2-AIMD: A Memory-Efficient Subgraph-Centric Framework for Efficient Subgraph Search on GPUs
Finding from a big graph those subgraphs that satisfy certain conditions (aka. subgraph search) is useful in many applications such as community detection and subgraph matching. These problems often generate a search-space tree with size exponential to the size of the input graph. GPUs with thousands of cores are a natural choice to speed up subgraph search, but existing GPU solutions either conduct BFS on the search-space tree which leads to memory overflow due to intermediate subgraph-size explosion, or they conduct DFS on the search-space tree which is memory-efficient but can be 2 orders of magnitude slower than a BFS solution. In this paper, we present G2-AIMD, a subgraph-centric framework for efficient subGraph Search on GPUs, which enjoys the efficiency of BFS on the search-space tree, while avoids intermediate subgraph-size explosion with novel system designs such as adaptive chunk-size adjustment and host-memory subgraph buffering, inspired by the additive-increase/multiplicative-decrease (AIMD) algorithm in TCP congestion control. G2-AIMD provides a convenient subgraph-centric programming interface to facilitate the implementation of subgraph search algorithms on top, so as to enjoy the above performance merits. G2-AIMD also supports multi-GPU execution where each GPU only needs to load a fraction of the input graph. To demonstrate the efficiency and scalability of G2-AIMD, two algorithms were implemented on top with additional optimization techniques, and they significantly outperform the existing GPU solutions.  more » « less
Award ID(s):
2229394
PAR ID:
10498840
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
40th IEEE International Conference on Data Engineering
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Subgraph search problems such as maximal clique enumeration and subgraph matching generate a search-space tree which is traversed in depth-first manner by serial backtracking algorithms that are recursive. Since Jenkins et al. reported the backtracking paradigm to be sub-optimal for GPU acceleration, breadth-first traversal of the search-space tree is widely adopted by GPU algorithms. However, they produce a lot of intermediate subgraphs that exhaust the GPU device memory. Recent works revive the depth-first backtracking paradigm for GPU acceleration, where each warp is a basic processing unit with its own stack in device memory for subgraph backtracking. However, they adopt complicated methods for load balancing that incur a lot of overheads. They also use hardcoded fixed space for stacks that is determined ad-hoc and may lead to inaccuracy when the allocated space is insufficient. In this paper, we use subgraph matching as a case study to propose novel depth-first GPU solutions to address the above problems. Our approach, called T-DFS, decomposes the compu- tation into independent tasks that process search-space subtrees, which are managed by an efficient lock-free circular task queue. Tasks are distributed to different warps for parallel processing, and a novel timeout mechanism is used to eliminate straggler tasks to ensure load balancing. We also support flexible and fine- grained dynamic memory allocation for stack spaces to avoid the stack space allocation pitfalls of existing works. Extensive experi- ments on real graphs show that T-DFS significantly outperforms existing depth-first GPU solutions for the subgraph matching application. 
    more » « less
  2. GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexi- ble HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88×across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75×speedup through efficient design and overlapping; on graph applications, AGILE reduces soft- ware cache overhead by up to 3.12×and NVMe I/O overhead by up to 2.85×; AGILE also lowers per-thread register usage by up to 1.32×. 
    more » « less
  3. In this paper, we propose a novel method, GSM, to compute graph matching (subgraph isomorphism) on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock program- ming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the- art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge, demonstrating a geometric mean speedup over the 2018 champion of 3.84x 
    more » « less
  4. On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium 
    more » « less
  5. Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental problem in analyzing massive graphs, with the main algorithmic challenges centering on designing methods which are both scalable and accurate. Given a graph G = (V, E) with n vertices, m edges and T triangles, our first result is an algorithm that outputs a (1+ε)-approximation to T, with asymptotically optimal round and total space complexity provided any S ≥ max{(√ m, n²/m)} space per machine and assuming T = Ω(√{m/n}). Our result gives a quadratic improvement on the bound on T over previous works. We also provide a simple extension of our result to counting any subgraph of k size for constant k ≥ 1. Our second result is an O_δ(log log n)-round algorithm for exactly counting the number of triangles, whose total space usage is parametrized by the arboricity α of the input graph. We extend this result to exactly counting k-cliques for any constant k. Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most 5 can be implemented in the MPC model in Õ_δ(√{log n}) rounds, O(n^δ) space per machine and O(mα³) total space. In addition to our theoretical results, we simulate our triangle counting algorithms in real-world graphs obtained from the Stanford Network Analysis Project (SNAP) database. Our results show that both our approximate and exact counting algorithms exhibit improvements in terms of round complexity and approximation ratio, respectively, compared to two previous widely used algorithms for these problems. 
    more » « less