skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Communication-efficient algorithms for solving pressure Poisson equation for multiphase flows using parallel computers
Numerical solution of partial differential equations on parallel computers using domain decomposition usually requires synchronization and communication among the processors. These operations often have a significant overhead in terms of time and energy. In this paper, we propose communication-efficient parallel algorithms for solving partial differential equations that alleviate this overhead. First, we describe an asynchronous algorithm that removes the requirement of synchronization and checks for termination in a distributed fashion while maintaining the provision to restart iterations if necessary. Then, we build on the asynchronous algorithm to propose an event-triggered communication algorithm that communicates the boundary values to neighboring processors only at certain iterations, thereby reducing the number of messages while maintaining similar accuracy of solution. We demonstrate our algorithms on a successive over-relaxation solver for the pressure Poisson equation arising from variable density incompressible multiphase flows in 3-D and show that our algorithms improve time and energy efficiency.  more » « less
Award ID(s):
1953082 2225978
PAR ID:
10416444
Author(s) / Creator(s):
; ; ;
Editor(s):
Riahi, Mohamed Kamel
Date Published:
Journal Name:
PLOS ONE
Volume:
17
Issue:
11
ISSN:
1932-6203
Page Range / eLocation ID:
e0277940
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Quantum computers can produce a quantum encoding of the solution of a system of differential equations exponentially faster than a classical algorithm can produce an explicit description. However, while high-precision quantum algorithms for linear ordinary differential equations are well established, the best previous quantum algorithms for linear partial differential equations (PDEs) have complexity p o l y ( 1 / ϵ ) , where ϵ is the error tolerance. By developing quantum algorithms based on adaptive-order finite difference methods and spectral methods, we improve the complexity of quantum algorithms for linear PDEs to be p o l y ( d , log ⁡ ( 1 / ϵ ) ) , where d is the spatial dimension. Our algorithms apply high-precision quantum linear system algorithms to systems whose condition numbers and approximation errors we bound. We develop a finite difference algorithm for the Poisson equation and a spectral algorithm for more general second-order elliptic equations. 
    more » « less
  2. We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-gaphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter. 
    more » « less
  3. A b-matching is a subset of edges M such that at most b(v) edges in M are incident on each vertex v, where b(v) is specified. We present a distributed-memory parallel algorithm, \bsuitor, that computes a b-matching with more than half the maximum weight in a graph with weights on the edges. The approximation algorithm is designed to have high concurrency and low time complexity. We organize the implementation of the algorithm in terms of asynchronous super-steps that combine computation and communication, and balance the computational work and frequency of communication to obtain high performance. Since the performance of the b-suitor algorithm is strongly influenced by communication, we present several strategies to reduce the communication volume. We implement the algorithm using a hybrid strategy where inter-node communication uses MPI and intra-node computation is done with OpenMP threads. We demonstrate strong and weak scaling of b-suitor up to 16,000 processors on two supercomputers at NERSC. We compute a b-matching in a graph with 2 billion edges in under 4 seconds using 16,000 processors. 
    more » « less
  4. Mutzel, Petra; Prezza, Nicola (Ed.)
    We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical contexts such as data summarization, machine learning, and graph sparsification. Our work builds on the randomized distributed RandGreeDI algorithm, proposed by Barbosa, Ene, Nguyen, and Ward (2015). This algorithm computes a distributed solution by randomly partitioning the data among all the processors and then employing a single accumulation step in which all processors send their partial solutions to one processor. However, for large problems, the accumulation step exceeds the memory available on a processor, and the processor which performs the accumulation becomes a computational bottleneck. Hence we propose a generalization of the RandGreeDI algorithm that employs multiple accumulation steps to reduce the memory required. We analyze the approximation ratio and the time complexity of the algorithm (in the BSP model). We evaluate the new GreedyML algorithm on three classes of problems, and report results from large-scale data sets with millions of elements. The results show that the GreedyML algorithm can solve problems where the sequential Greedy and distributed RandGreeDI algorithms fail due to memory constraints. For certain computationally intensive problems, the GreedyML algorithm is faster than the RandGreeDI algorithm. The observed approximation quality of the solutions computed by the GreedyML algorithm closely matches those obtained by the RandGreeDI algorithm on these problems. 
    more » « less
  5. Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter graphs. To tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round). We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each graph. We believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems. 
    more » « less