skip to main content


Title: Fast BFS-Based Triangle Counting on GPUs
In this paper, we propose a novel method, GSM, to compute graph matching (subgraph isomorphism) on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock program- ming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the- art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge, demonstrating a geometric mean speedup over the 2018 champion of 3.84x  more » « less
Award ID(s):
1740333 1629657
NSF-PAR ID:
10171721
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the IEEE High Performance Extreme Computing Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter graphs.

    To tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).

    We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each graph.

    We believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.

     
    more » « less
  2. Finding from a big graph those subgraphs that satisfy certain conditions (aka. subgraph search) is useful in many applications such as community detection and subgraph matching. These problems often generate a search-space tree with size exponential to the size of the input graph. GPUs with thousands of cores are a natural choice to speed up subgraph search, but existing GPU solutions either conduct BFS on the search-space tree which leads to memory overflow due to intermediate subgraph-size explosion, or they conduct DFS on the search-space tree which is memory-efficient but can be 2 orders of magnitude slower than a BFS solution. In this paper, we present G2-AIMD, a subgraph-centric framework for efficient subGraph Search on GPUs, which enjoys the efficiency of BFS on the search-space tree, while avoids intermediate subgraph-size explosion with novel system designs such as adaptive chunk-size adjustment and host-memory subgraph buffering, inspired by the additive-increase/multiplicative-decrease (AIMD) algorithm in TCP congestion control. G2-AIMD provides a convenient subgraph-centric programming interface to facilitate the implementation of subgraph search algorithms on top, so as to enjoy the above performance merits. G2-AIMD also supports multi-GPU execution where each GPU only needs to load a fraction of the input graph. To demonstrate the efficiency and scalability of G2-AIMD, two algorithms were implemented on top with additional optimization techniques, and they significantly outperform the existing GPU solutions. 
    more » « less
  3. Abstract The big graph database provides strong modeling capabilities and efficient querying for complex applications. Subgraph isomorphism which finds exact matches of a query graph in the database efficiently, is a challenging problem. Current subgraph isomorphism approaches mostly are based on the pruning strategy proposed by Ullmann. These techniques have two significant drawbacks- first, they are unable to efficiently handle complex queries, and second, their implementations need the large indexes that require large memory resources. In this paper, we describe a new subgraph isomorphism approach, the HyGraph algorithm, that is efficient both in querying and with memory requirements for index creation. We compare the HyGraph algorithm with two popular existing approaches, GraphQL and Cypher using complexity measures and experimentally using three big graph data sets—(1) a country-level population database, (2) a simulated bank database, and (3) a publicly available World Cup big graph database. It is shown that the HyGraph solution performs significantly better (or equally) than competing algorithms for the query operations on these big databases, making it an excellent candidate for subgraph isomorphism queries in real scenarios. 
    more » « less
  4. null (Ed.)
    Consider an algorithm performing a computation on a huge random object (for example a random graph or a "long" random walk). Is it necessary to generate the entire object prior to the computation, or is it possible to provide query access to the object and sample it incrementally "on-the-fly" (as requested by the algorithm)? Such an implementation should emulate the random object by answering queries in a manner consistent with an instance of the random object sampled from the true distribution (or close to it). This paradigm is useful when the algorithm is sub-linear and thus, sampling the entire object up front would ruin its efficiency. Our first set of results focus on undirected graphs with independent edge probabilities, i.e. each edge is chosen as an independent Bernoulli random variable. We provide a general implementation for this model under certain assumptions. Then, we use this to obtain the first efficient local implementations for the Erdös-Rényi G(n,p) model for all values of p, and the Stochastic Block model. As in previous local-access implementations for random graphs, we support Vertex-Pair and Next-Neighbor queries. In addition, we introduce a new Random-Neighbor query. Next, we give the first local-access implementation for All-Neighbors queries in the (sparse and directed) Kleinberg’s Small-World model. Our implementations require no pre-processing time, and answer each query using O(poly(log n)) time, random bits, and additional space. Next, we show how to implement random Catalan objects, specifically focusing on Dyck paths (balanced random walks on the integer line that are always non-negative). Here, we support Height queries to find the location of the walk, and First-Return queries to find the time when the walk returns to a specified location. This in turn can be used to implement Next-Neighbor queries on random rooted ordered trees, and Matching-Bracket queries on random well bracketed expressions (the Dyck language). Finally, we introduce two features to define a new model that: (1) allows multiple independent (and even simultaneous) instantiations of the same implementation, to be consistent with each other without the need for communication, (2) allows us to generate a richer class of random objects that do not have a succinct description. Specifically, we study uniformly random valid q-colorings of an input graph G with maximum degree Δ. This is in contrast to prior work in the area, where the relevant random objects are defined as a distribution with O(1) parameters (for example, n and p in the G(n,p) model). The distribution over valid colorings is instead specified via a "huge" input (the underlying graph G), that is far too large to be read by a sub-linear time algorithm. Instead, our implementation accesses G through local neighborhood probes, and is able to answer queries to the color of any given vertex in sub-linear time for q ≥ 9Δ, in a manner that is consistent with a specific random valid coloring of G. Furthermore, the implementation is memory-less, and can maintain consistency with non-communicating copies of itself. 
    more » « less
  5. null (Ed.)
    There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this article, we revisit the scalability problem of interprocedural static analysis from a “Big Data” perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We propose Graspan , a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We develop two backends for Graspan, namely, Graspan-C running on CPUs and Graspan-G on GPUs, and present their designs in the article. Graspan-C can analyze large-scale systems code on any commodity PC, while, if GPUs are available, Graspan-G can be readily used to achieve orders of magnitude speedup by harnessing a GPU’s massive parallelism. We have implemented fully context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases written in multiple languages such as Linux and Apache Hadoop demonstrates that their Graspan implementations are language-independent, scale to millions of lines of code, and are much simpler than their original implementations. Moreover, we show that these analyses can be used to uncover many real-world bugs in large-scale systems code. 
    more » « less