We present sharedmemory parallel methods for Maximal Clique Enumeration (MCE) from a graph. MCE is a fundamental and wellstudied graph analytics task, and is a widely used primitive for identifying dense structures in a graph. Due to its computationally intensive nature, parallel methods are imperative for dealing with large graphs. However, surprisingly, there do not yet exist scalable and parallel methods for MCE on a sharedmemory parallel machine. In this work, we present efficient sharedmemory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably workefficient relative to a stateoftheart sequential algorithm (2) the algorithms have a provably small parallel depth, showing that they can scale to a large number of processors, and (3) our implementations on a multicore machine shows a good speedup and scaling behavior with increasing number of cores, and are substantially faster than prior sharedmemory parallel algorithms for MCE.
SharedMemory Parallel Maximal Biclique Enumeration
We present shared memory parallel algorithms for maximal biclique enumeration (MBE), the task of enumerating all complete dense subgraphs (maximal bicliques) from a bipartite graph, which is widely used in the analysis of social, biological, and transactional networks. Since MBE is computationally expensive, it is necessary to use parallel computing to scale to large graphs. Our parallel algorithm ParMBE efficiently uses the power of multiple cores that share memory. From a theoretical view, ParMBE is workefficient with respect to a stateoftheart sequential algorithm. Our experimental evaluation shows that ParMBE scales well up to 64 cores, and is significantly faster than current parallel algorithms. Since ParMBE was yielding a superlinear speedup compared to the sequential algorithm on which it was based (MineLMBC), we develop an improved sequential algorithm FMBE, through "sequentializing" ParMBE.
 Award ID(s):
 1725702
 Publication Date:
 NSFPAR ID:
 10179794
 Journal Name:
 26th {IEEE} International Conference on High Performance Computing
 Page Range or eLocationID:
 34 to 43
 Sponsoring Org:
 National Science Foundation
More Like this


Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasiclique is a natural definition for dense structures, so finding large and hence statistically significant quasicliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive, and even a recent algorithm for mining large maximal quasicliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasicliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeoutbased task decomposition strategy, and by (b) utilizing a priority task queue to schedule longrunning tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called Gthinker, thismore »

The Lovász Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. This algorithm can be parallelized to give an algorithm that uses polynomially many processors and runs in O(log3 n) time, stemming from O(log n) adaptive computations of a maximal independent set (MIS). Chung et al. (2014) developed faster local and parallel algorithms, potentially running in time O (log^2 n), but these algorithms work under significantly more stringent conditions than the LLL. We give a new parallel algorithm that works under essentially the same conditions as the original algorithm of Moser & Tardos but uses only a single MIS computation, thus running in O(log^2 n) time. This conceptually new algorithm also gives a clean combinatorial description of a satisfying assignment which might be of independent interest. Our techniques extend to the deterministic LLL algorithm given by Chandrasekaran et al. (2013) leading to an NCalgorithm running in time O(log^2 n) as well. We also provide improved bounds on the runtimes of the sequential and parallel resamplingbased algorithms originally developed by Moser & Tardos. Our bounds extend to any problem instancemore »

Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and singleinstruction multipledata (SIMD) operations. We take advantage of the available intertask parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in runtimemore »

null (Ed.)On sharedmemory multicore machines, classic twoway recursive divideandconquer algorithms are implemented using common forkjoin based parallel programming paradigms such as Intel Cilk+ or OpenMP. However, in such parallel paradigms, the use of joins for synchronization may lead to artificial dependencies among function calls which are not implied by the underlying DP recurrence. These artificial dependencies can increase the span asymptotically and thus reduce parallelism. From a practical perspective, they can lead to resource underutilization, i.e., threads becoming idle. To eliminate such artificial dependencies, taskbased runtime systems and dataflow parallel paradigms, such as Concurrent Collections (CnC), PaRSEC, and Legion have been introduced. Such parallel paradigms and runtime systems overcome the limitations of forkjoin parallelism by specifying data dependencies at a finer granularity and allowing tasks to execute as soon as dependencies are satisfied.In this paper, we investigate how the performance of dataflow implementations of recursive divideandconquer based DP algorithms compare with forkjoin implementations. We have designed and implemented dataflow versions of DP algorithms in Intel CnC and compared the performance with forkjoin based implementations in OpenMP. Considering different execution parameters (e.g., algorithmic properties such as recursive base size as well as machine configuration such as the number of physical cores, etc),more »