skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Configuring Concurrent Computation of Phylogenetic Partial Likelihoods: Accelerating Analyses Using the BEAGLE Library
We describe our approach in augmenting the BEAGLE library for high-performance statistical phylogenetic inference to support concurrent computation of independent partial likelihoods arrays. Our solution involves identifying independent likelihood estimates in analyses of partitioned datasets and in proposed tree topologies, and configuring concurrent computation of these likelihoods via CUDA and OpenCLl frameworks. We evaluate the effect of each increase in concurrency on throughput performance for our partial likelihoods kernel for a four-state nucleotide substitution model on a variety of parallel computing hardware, such as NVIDIA and AMD GPUs, and Intel multicore CPUs, observing up to 16-fold speedups over our previous implementation. Finally, we evaluate the effect of these gains on an domain application program, mrbayes. For a partitioned nucleotide-model analysis we observe an average speedup for the overall run time of 2.1-fold over our previous parallel implementation, and 10-fold over the native mrbayes with sse.  more » « less
Award ID(s):
1661443
PAR ID:
10057727
Author(s) / Creator(s):
Date Published:
Journal Name:
Algorithms and Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer Science
Volume:
10393
Page Range / eLocation ID:
533–547
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Multi-Level Fast Multipole Algorithm (MLFMA), a variant of the fast multiple method (FMM) for problems with oscillatory potentials, significantly accelerates the solution of problems based on wave physics, such as those in electromagnetics and acoustics. Existing shared memory parallel approaches for MLFMA have adopted the bulk synchronous parallel (BSP) model. While the BSP approach has served well so far, it is prone to significant thread synchronization overheads, but more importantly fails to leverage the communication/computation overlap opportunities due to complicated data dependencies in MLFMA. In this paper, we develop a task parallel MLFMA implementation for shared memory architectures, and discuss optimizations to improve its performance. We then evaluate the new task parallel MLFMA implementation against a BSP implementation for a number of geometries. Our findings suggest that task parallelism is generally superior to the BSP model, and considering its potential advantages over the BSP model in a hybrid parallel setting, we see it to be a promising approach in addressing the scalability issues of MLFMA in large scale computations. 
    more » « less
  2. In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to 29x over a baseline sequential implementation and parallel speedups of up to 26.2x over optimized sequential implementations using up to 32 threads on an Intel multicore CPU. 
    more » « less
  3. Breadth-First Search (BFS) is a fundamental graph traversal algorithm in a level-by-level pattern. It has been widely used in real-world applications, such as social network analysis, scientific computing, and web crawling. However, achieving high performance for BFS on large-scale graphs remains a challenging task due to irregular memory access patterns, diverse graph structures, and the necessity for efficient parallelization. This paper addresses these challenges by designing a highly optimized parallel BFS implementation based on the top-down and bottom-up traversal strategies. It further integrates several key innovations, including graph typea-ware computation strategy selection, graph pruning, twolevel bottom-up, and efficient parallel implementation. We evaluate our method on 11 diverse graphs in terms of size, diameter, and density. On a CPU server with 48 threads, our method achieves an average speedup of 9.5x over the serial BFS implementation. Also, on a synthetic dense graph, our method processes 9.3 billion edges per second, showing its efficiency in dense graph traversal. 
    more » « less
  4. A considerable amount of research on parallel discrete event simulation has been conducted over the past few decades. However, most of this research has targeted the parallel simulation infrastructure; focusing on data structures, algorithms, and synchronization methods for parallel simulation kernels. Unfortunately, distributed environments often have high communication latencies that can reduce the potential performance of parallel simulations. Effective partitioning of the concurrent simulation objects of the real world models can have a large impact on the amount of network traffic necessary in the simulation, and consequently the overall performance. This paper presents our studies on profiling the characteristics of simulation models and using the collected data to perform partitioning of the models for concurrent execution. Our benchmarks show that Profile Guided Partitioning can result in dramatic performance gains in the parallel simulations. In some of the models, 5-fold improvements of the run time of the concurrently executed simulations were observed. 
    more » « less
  5. We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one data-centric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-the-art implementations on real-world datasets. Our GraphBLAS implementation of Luby's algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x. 
    more » « less