skip to main content


Title: HBMax: Optimizing Memory Efficiency for Parallel Influence Maximization on Multicore Architectures
Influence maximization aims to select k most-influential vertices or seeds in a network, where influence is defined by a given diffusion process. Although computing optimal seed set is NP-Hard, efficient approximation algorithms exist. However, even state-of-the-art parallel implementations are limited by a sampling step that incurs large memory footprints. This in turn limits the problem size reach and approximation quality. In this work, we study the memory footprint of the sampling process collecting reverse reachability information in the IMM (Influence Maximization via Martingales) algorithm over large real-world social networks. We present a memory-efficient optimization approach (called HBMax) based on Ripples, a state-of-the-art multi-threaded parallel influence maximization solution. Our approach, HBMax, uses a portion of the reverse reachable (RR) sets collected by the algorithm to learn the characteristics of the graph. Then, it compresses the intermediate reverse reachability information with Huffman coding or bitmap coding, and queries on the partially decoded data, or directly on the compressed data to preserve the memory savings obtained through compression. Considering a NUMA architecture, we scale up our solution on 64 CPU cores and reduce the memory footprint by up to 82.1% with average 6.3% speedup (encoding overhead is offset by performance gain from memory reduction) without loss of accuracy. For the largest tested graph Twitter7 (with 1.4 billion edges), HBMax achieves 5.9× compression ratio and 2.2× speedup.  more » « less
Award ID(s):
2034169 1948447 2303820
NSF-PAR ID:
10377028
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
The 31st International Conference on Parallel Architectures and Compilation Techniques (PACT 2022)
Page Range / eLocation ID:
412 to 425
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Influence Maximization (IM) is a crucial problem in data science. The goal is to find a fixed-size set of highly influentialseedvertices on a network to maximize the influence spread along the edges. While IM is NP-hard on commonly used diffusion models, a greedy algorithm can achieve (1 - 1/e)-approximation by repeatedly selecting the vertex with the highestmarginal gainin influence as the seed. However, we observe two performance issues in the existing work that prevent them from scaling to today's large-scale graphs: space-inefficient memorization to estimate marginal gain, and time-inefficient seed selection process due to a lack of parallelism.

    This paper significantly improves the scalability of IM using two key techniques. The first is asketch-compressiontechnique for the independent cascading model on undirected graphs. It allows combining the simulation and sketching approaches to achieve a time-space tradeoff. The second technique includes new data structures for parallel seed selection. Using our new approaches, we implementedPaC-IM: Parallel and Compressed IM.

    We comparePaC-IMwith state-of-the-art parallel IM systems on a 96-core machine with 1.5TB memory.PaC-IMcan process the ClueWeb graph with 978M vertices and 75B edges in about 2 hours. On average, across all tested graphs, our uncompressed version is 5--18x faster and about 1.4x more space-efficient than existing parallel IM systems. Using compression further saves 3.8x space with only 70% overhead in time on average.

     
    more » « less
  2. Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0x and 6.8x on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3x over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs. 
    more » « less
  3. Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter graphs.

    To tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).

    We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each graph.

    We believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.

     
    more » « less
  4. Viral marketing on social networks, also known as Influence Maximization (IM), aims to select k users for the promotion of a target item by maximizing the total spread of their influence. However, most previous works on IM do not explore the dynamic user perception of promoted items in the process. In this paper, by exploiting the knowledge graph (KG) to capture dynamic user perception, we formulate the problem of Influence Maximization based on Dynamic Personal Perception (IMDPP) that considers user preferences and social influence reflecting the impact of relevant item adoptions. We prove the hardness of IMDPP and design an approximation algorithm, named Dynamic perception for seeding in target markets (Dysim), by exploring the concepts of dynamic reachability, target markets, and substantial influence to select and promote a sequence of relevant items. We evaluate the performance of Dysim in comparison with the state-of-the-art approaches using real social networks with real KGs. The experimental results show that Dysim effectively achieves at least 6 times of influence spread in large datasets over the state-of-the-art approaches. 
    more » « less
  5. Context-free language reachability (CFL-reachability) is a fundamental framework for program analysis. A large variety of static analyses can be formulated as CFL-reachability problems, which determines whether specific source-sink pairs in an edge-labeled graph are connected by a reachable path, i.e., a path whose edge labels form a string accepted by the given CFL. Computing CFL-reachability is expensive. The fastest algorithm exhibits a slightly subcubic time complexity with respect to the input graph size. Improving the scalability of CFL-reachability is of practical interest, but reducing the time complexity is inherently difficult. In this paper, we focus on improving the scalability of CFL-reachability from a more practical perspective---reducing the input graph size. Our idea arises from the existence of trivial edges, i.e., edges that do not affect any reachable path in CFL-reachability. We observe that two nodes joined by trivial edges can be folded---by merging the two nodes with all the edges joining them removed---without affecting the CFL-reachability result. By studying the characteristic of the recursive state machines (RSMs), an alternative form of CFLs, we propose an approach to identify foldable node pairs without the need to verify the underlying reachable paths (which is equivalent to solving the CFL-reachability problem). In particular, given a CFL-reachability problem instance with an input graph G and an RSM, based on the correspondence between paths in G and state transitions in RSM, we propose a graph folding principle, which can determine whether two adjacent nodes are foldable by examining only their incoming and outgoing edges. On top of the graph folding principle, we propose an efficient graph folding algorithm GF. The time complexity of GF is linear with respect to the number of nodes in the input graph. Our evaluations on two clients (alias analysis and value-flow analysis) show that GF significantly accelerates RSM/CFL-reachability by reducing the input graph size. On average, for value-flow analysis, GF reduces 60.96% of nodes and 42.67% of edges of the input graphs, obtaining a speedup of 4.65× and a memory usage reduction of 57.35%. For alias analysis, GF reduces 38.93% of nodes and 35.61% of edges of the input graphs, obtaining a speedup of 3.21× and a memory usage reduction of 65.19%. 
    more » « less