skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Low-Depth Parallel Algorithms for the Binary-Forking Model without Atomics
The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n - factor blow-up in work as well as a near-optimal O(log n loglog log n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log n-factor blow-up in work for all practical values of n (i.e., n le 10 ^10,000)  more » « less
Award ID(s):
1938180 2106999 2118620
PAR ID:
10298564
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Annual ACM Symposium on Parallelism in Algorithms and Architectures
ISSN:
1548-6109
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n - factor blow-up in work as well as a near-optimal O(log n loglog log n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log n-factor blow-up in work for all practical values of n (i.e., n le 10 ^10,000 ). 
    more » « less
  2. Azar, Yossi (Ed.)
    A data-oblivious algorithm is an algorithm whose memory access pattern is independent of the input values. We initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary fork-join model of computation. We present a data-oblivious CREW binary fork-join sorting algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(łog n łog łog n) span (i.e., parallel time); these bounds match the best-known bounds for binary fork-join cache-efficient insecure algorithms. Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency, and we present data-oblivious algorithms for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. All of our data oblivious algorithms have bounds that either match or improve over the best known bounds for insecure algorithms. Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is self-contained and potentially implementable. It has optimal caching cost, and it is only a łog łog n factor off from optimal work and about a łog n factor off in terms of span. We also present an EREW variant with optimal work and caching cost, and with the same asymptotic span. 
    more » « less
  3. Dynamic trees are a well-studied and fundamental building block of dynamic graph algorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. The problem is to maintain a tree subject to online edge insertions and deletions while answering queries about the tree, such as the heaviest weight on a path, etc. In the parallel batch-dynamic setting, the goal is to process batches of edge updates work efficiently in low (polylog n) span. Two work-efficient algorithms are known: batch-parallel Euler Tour Trees by Tseng et al. [ALENEX'19, (2019), pp. 92--106] and parallel Rake-Compress (RC) Trees by Acar et al. [ESA'20, (2020), pp. 2:1--2:23]. Both however are randomized and work efficient in expectation. Several downstream results that use these data structures (and indeed to the best of our knowledge, all known work-efficient parallel batch-dynamic graph algorithms) are therefore also randomized. In this work, we give the first deterministic work-efficient solution to the problem. Our algorithm maintains a parallel RC-Tree on n vertices subject to batches of k edge updates deterministically in worst-case O(k log(1 + n/k)) work and O(log n loglog k) span on the Common-CRCW PRAM. We also show how to improve the span of the randomized algorithm from O(log n log* n) to O(log n). Lastly, as a result of our new deterministic algorithm, we also derandomize several downstream results that make use of parallel batch-dynamic dynamic trees, previously for which the only efficient solutions were randomized. 
    more » « less
  4. We study the connections between sorting and the binary search tree (BST) model, with an aim towards showing that the fields are connected more deeply than is currently appreciated. While any BST can be used to sort by inserting the keys one-by-one, this is a very limited relationship and importantly says nothing about parallel sorting. We show what we believe to be the first formal relationship between the BST model and sorting. Namely, we show that a large class of sorting algorithms, which includes mergesort, quicksort, insertion sort, and almost every instance-optimal sorting algorithm, are equivalent in cost to offline BST algorithms. Our main theoretical tool is the geometric interpretation of the BST model introduced by Demaine et al. [18], which finds an equivalence between searches on a BST and point sets in the plane satisfying a certain property. To give an example of the utility of our approach, we introduce the log-interleave bound, a measure of the information-theoretic complexity of a permutation π, which is within a lg lg n multiplicative factor of a known lower bound in the BST model; we also devise a parallel sorting algorithm with polylogarithmic span that sorts a permutation π using comparisons proportional to its log-interleave bound. Our aforementioned result on sorting and offline BST algorithms can be used to show existence of an offline BST algorithm whose cost is within a constant factor of the log-interleave bound of any permutation π. 
    more » « less
  5. Parallel radix sorting has been extensively studied in the literature for decades. However, the most efficient implementations require auxiliary memory proportional to the input size, which can be prohibitive for large inputs. The standard serial in-place radix sorting algorithm is based on swapping each key to its correct place in the array on each level of recursion. However, it is not straightforward to parallelize this algorithm due to dependencies among the swaps. This paper presents Regions Sort, a new parallel in-place radix sorting algorithm that is efficient both in theory and in practice. Our algorithm uses a graph structure to model dependencies across elements that need to be swapped, and generates independent tasks from this graph that can be executed in parallel. For sorting $$n$$ integers from a range $$r$$, and a parameter $$K$$, Regions Sort requires only $$O(K\log r\log n)$$ auxiliary memory. Our algorithm requires $$O(n\log r)$$ work and $$O((n/K+\log K)\log r)$$ span, making it work-efficient and highly parallel. In addition, we present several optimizations that significantly improve the empirical performance of our algorithm. We compare the performance of Regions Sort to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and show that Regions Sort is faster than optimized out-of-place radix sorting and comparison sorting algorithms, and is almost always faster than the fastest publicly-available in-place sorting algorithm. 
    more » « less