skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Sequential and Shared-Memory Parallel Algorithms for Partitioned Local Depths
In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to 29x over a baseline sequential implementation and parallel speedups of up to 26.2x over optimized sequential implementations using up to 32 threads on an Intel multicore CPU.  more » « less
Award ID(s):
2106920
PAR ID:
10510807
Author(s) / Creator(s):
;
Publisher / Repository:
SIAM
Date Published:
Journal Name:
Proceedings of the 2024 SIAM Conference on Parallel Processing for Scientific Computing
Page Range / eLocation ID:
53--64
Format(s):
Medium: X
Location:
Baltimore, MD
Sponsoring Org:
National Science Foundation
More Like this
  1. Butterflies are the smallest non-trivial subgraph in bipartite graphs, and therefore having efficient computations for analyzing them is crucial to improving the quality of certain applications on bipartite graphs. In this paper, we design a framework called ParButterfly that contains new parallel algorithms for the following problems on processing butterflies: global counting, per-vertex counting, per-edge counting, tip decomposition (vertex peeling), and wing decomposition (edge peeling). The main component of these algorithms is aggregating wedges incident on subsets of vertices, and our framework supports different methods for wedge aggregation, including sorting, hashing, histogramming, and batching. In addition, ParButterfly supports different ways of ranking the vertices to speed up counting, including side ordering, approximate and exact degree ordering, and approximate and exact complement coreness ordering. For counting, ParButterfly also supports both exact computation as well as approximate computation via graph sparsification. We prove strong theoretical guarantees on the work and span of the algorithms in ParButterfly. We perform a comprehensive evaluation of all of the algorithms in ParButterfly on a collection of real-world bipartite graphs using a 48-core machine. Our counting algorithms obtain significant parallel speedup, outperforming the fastest sequential algorithms by up to 13.6× with a self-relative speedup of up to 38.5×. Compared to general subgraph counting solutions, we are orders of magnitude faster. Our peeling algorithms achieve self-relative speedups of up to 10.7× and outperform the fastest sequential baseline by up to several orders of magnitude. 
    more » « less
  2. Stochastic block partitioning (SBP) is a community detection algorithm that is highly accurate even on graphs with a complex community structure, but its inherently serial nature hinders its widespread adoption by the wider scientific community. To make it practical to analyze large real-world graphs with SBP, there is a growing need to parallelize and distribute the algorithm. The current state-of-the-art distributed SBP algorithm is a divide-and-conquer approach that limits communication between compute nodes until the end of inference. This leads to the breaking of computational dependencies, which causes convergence issues as the number of compute nodes increases and when the graph is sufficiently sparse. To address this shortcoming, we introduce EDiSt — an exact distributed stochastic block partitioning algorithm. Under EDiSt, compute nodes periodically share community assignments during inference. Due to this additional communication, EDiSt improves upon the divide-and-conquer algorithm by allowing it to scale out to a larger number of compute nodes without suffering from convergence issues, even on sparse graphs. We show that EDiSt provides speedups of up to 26.9x over the divide-and-conquer approach and speedups up to 44.0x over shared memory parallel SBP when scaled out to 64 compute nodes. 
    more » « less
  3. Matrix multiplication is one of the bottleneck computations for training the weights within deep neural networks. To speed up the training phase, we propose to use faster algorithms for matrix multiplication known as Arbitrary Precision Approximating (APA) algorithms. APA algorithms perform asymptotically fewer arithmetic operations than the classical algorithm, but they compute an approximate result with an error that can be made arbitrarily small in exact arithmetic. Practical APA algorithms provide significant reduction in computation time and still provide enough accuracy for many applications like neural network training. We demonstrate that APA algorithms can be efficiently implemented and parallelized for multicore CPUs to obtain up to 28% and 21% speedups over the fastest implementation of the classical algorithm using one core and 12 cores, respectively. Furthermore, using these algorithms to train a Multi-Layer Perceptron (MLP) network yields no significant reduction in the training or testing error. Our performance results on a large MLP network show overall sequential and multithreaded performance improvements of up to 25% and 13%, respectively. We also demonstrate up to 15% improvement when training the fully connected layers of the VGG-19 image classification network. 
    more » « less
  4. The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation. 
    more » « less
  5. Si, Hang; Shepherd, Kendrick M; Zhang, Yongjie Jessica (Ed.)
    Warping large volume meshes has applications in biomechanics, aerodynamics, image processing, and cardiology. However, warping large, real-world meshes is computationally expensive. Existing parallel implementations of mesh warping algorithms do not take advantage of shared-memory and one-sided communication features available in the MPI-3 standard. We describe our parallelization of the finite element-based mesh warping algorithm for tetrahedral meshes. Our implementation is portable across shared and distributed memory architectures, as it takes advantage of shared memory and one-sided communication to precompute neighbor lists in parallel. We then deform a mesh by solving a Poisson boundary value problem and the resulting linear system, which has multiple right-hand sides, in parallel. Our results demonstrate excellent efficiency and strong scalability on up to 32 cores on a single node. Furthermore, we show a 33.9% increase in speedup with 256 cores distributed uniformly across 64 nodes versus our largest single node speedup while observing sublinear speedups overall. 
    more » « less