This content will become publicly available on November 1, 2024
Influence Maximization (IM) is a crucial problem in data science. The goal is to find a fixed-size set of highly influential
This paper significantly improves the scalability of IM using two key techniques. The first is a
We compare
- NSF-PAR ID:
- 10489410
- Publisher / Repository:
- VLDB Endowment
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 17
- Issue:
- 3
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 400 to 413
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Influence maximization aims to select k most-influential vertices or seeds in a network, where influence is defined by a given diffusion process. Although computing optimal seed set is NP-Hard, efficient approximation algorithms exist. However, even state-of-the-art parallel implementations are limited by a sampling step that incurs large memory footprints. This in turn limits the problem size reach and approximation quality. In this work, we study the memory footprint of the sampling process collecting reverse reachability information in the IMM (Influence Maximization via Martingales) algorithm over large real-world social networks. We present a memory-efficient optimization approach (called HBMax) based on Ripples, a state-of-the-art multi-threaded parallel influence maximization solution. Our approach, HBMax, uses a portion of the reverse reachable (RR) sets collected by the algorithm to learn the characteristics of the graph. Then, it compresses the intermediate reverse reachability information with Huffman coding or bitmap coding, and queries on the partially decoded data, or directly on the compressed data to preserve the memory savings obtained through compression. Considering a NUMA architecture, we scale up our solution on 64 CPU cores and reduce the memory footprint by up to 82.1% with average 6.3% speedup (encoding overhead is offset by performance gain from memory reduction) without loss of accuracy. For the largest tested graph Twitter7 (with 1.4 billion edges), HBMax achieves 5.9× compression ratio and 2.2× speedup.more » « less
-
Abstract Motivation The size of a genome graph—the space required to store the nodes, node labels and edges—affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs.
Results We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel–Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit.
Availability The RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph.
Supplementary information Supplementary data are available at Bioinformatics online.
-
We present the first GPU-based parallel algorithm to efficiently update vertex coloring on large dynamic networks. For single GPU, we introduce the concept of loosely maintained vertex color update that reduces computation and memory requirements. For multiple GPUs, in distributed environments, we propose priority-based ordering of vertices to reduce the communication time. We prove the correctness of our algorithms and experimentally demonstrate that for graphs of over 16 million vertices and over 134 million edges on a single GPU, our dynamic algorithm is as much as 20x faster than state-of-the-art algorithm on static graphs. For larger graphs with over 130 million vertices and over 260 million edges, our distributed implementation with 8 GPUs produces updated color assignments within 160 milliseconds. In all cases, the proposed parallel algorithms produce comparable or fewer colors than state-of-the-art algorithms.more » « less
-
Ranjit Jhala and Isil Dillig (Ed.)Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on bal- anced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks us- ing arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC- trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that im- plements the full functionality of PAM, while offering signifi- cant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1ś7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3ś2.6x less space than the state-of-the-art graph processing system Aspen.more » « less
-
We study a noisy graph isomorphism problem, where the goal is to perfectly recover the vertex correspondence between two edge‐correlated graphs, with an initial seed set of correctly matched vertex pairs revealed as side information. We show that it is possible to achieve the information‐theoretic limit of graph sparsity in time polynomial in the number of vertices
n . Moreover, we show the number of seeds needed for perfect recovery in polynomial‐time can be as low asin the sparse graph regime (with the average degree smaller than ) and in the dense graph regime, for a small positive constant . Unlike previous work on graph matching, which used small neighborhoods or small subgraphs with a logarithmic number of vertices in order to match vertices, our algorithms match vertices if their large neighborhoods have a significant overlap in the number of seeds.