The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging.
In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).
Our major results are as follows: (1) A randomized comparisonbased sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n  factor blowup in work as well as a nearoptimal O(log n loglog log n) span algorithm with no asymptotic blowup in work. (3) A nearoptimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log nfactor blowup in work for all practical values of n (i.e., n le 10 ^10,000)
more »
« less
LowSpan Parallel Algorithms for the BinaryForking Model
The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging.
In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).
Our major results are as follows: (1) A randomized comparisonbased sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n  factor blowup in work as well as a nearoptimal O(log n loglog log n) span algorithm with no asymptotic blowup in work. (3) A nearoptimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log nfactor blowup in work for all practical values of n (i.e., n le 10 ^10,000 ).
more »
« less
 NSFPAR ID:
 10298539
 Date Published:
 Journal Name:
 Proc. 33st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA)
 Page Range / eLocation ID:
 22 to 34
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Azar, Yossi (Ed.)A dataoblivious algorithm is an algorithm whose memory access pattern is independent of the input values. We initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary forkjoin model of computation. We present a dataoblivious CREW binary forkjoin sorting algorithm with optimal total work and optimal (cacheoblivious) cache complexity, and in O(łog n łog łog n) span (i.e., parallel time); these bounds match the bestknown bounds for binary forkjoin cacheefficient insecure algorithms. Using our sorting algorithm as a core primitive, we show how to dataobliviously simulate general PRAM algorithms in the binary forkjoin model with nontrivial efficiency, and we present dataoblivious algorithms for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. All of our data oblivious algorithms have bounds that either match or improve over the best known bounds for insecure algorithms. Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is selfcontained and potentially implementable. It has optimal caching cost, and it is only a łog łog n factor off from optimal work and about a łog n factor off in terms of span. We also present an EREW variant with optimal work and caching cost, and with the same asymptotic span.more » « less

In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but O(1) memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the (M,w)ARAM, in which there is a small (symmetric) memory of size M and a large unbounded (asymmetric) memory, both random access, and where reading from the large memory has unit cost, but writing has cost w >> 1. For FFT and sorting networks we show a lower bound cost of Omega(w*n*log_{w*M}(n)), which indicates that it is not possible to achieve asymptotic improvements with cheaper reads when w is bounded by a polynomial in M. Moreover, there is an asymptotic gap (of min(w,log(n)/log(w*M)) between the cost of sorting networks and comparison sorting in the model. This contrasts with the RAM, and most other models, in which the asymptotic costs are the same. We also show a lower bound for computations on an n*n diamond DAG of Omega(w*n^2/M) cost, which indicates no asymptotic improvement is achievable with fast reads. However, we show that for the minimum edit distance problem (and related problems), which would seem to be a diamond DAG, we can beat this lower bound with an algorithm with only O(w*n^2/(M*min(w^{1/3},M^{1/2}))) cost. To achieve this we make use of a "path sketch" technique that is forbidden in a strict DAG computation. Finally, we show several interesting upper bounds for shortest path problems, minimum spanning trees, and other problems. A common theme in many of the upper bounds is that they require redundant computation and a tradeoff between reads and writes.more » « less

We study the connections between sorting and the binary search tree (BST) model, with an aim towards showing that the fields are connected more deeply than is currently appreciated. While any BST can be used to sort by inserting the keys onebyone, this is a very limited relationship and importantly says nothing about parallel sorting. We show what we believe to be the first formal relationship between the BST model and sorting. Namely, we show that a large class of sorting algorithms, which includes mergesort, quicksort, insertion sort, and almost every instanceoptimal sorting algorithm, are equivalent in cost to offline BST algorithms. Our main theoretical tool is the geometric interpretation of the BST model introduced by Demaine et al. [18], which finds an equivalence between searches on a BST and point sets in the plane satisfying a certain property. To give an example of the utility of our approach, we introduce the loginterleave bound, a measure of the informationtheoretic complexity of a permutation π, which is within a lg lg n multiplicative factor of a known lower bound in the BST model; we also devise a parallel sorting algorithm with polylogarithmic span that sorts a permutation π using comparisons proportional to its loginterleave bound. Our aforementioned result on sorting and offline BST algorithms can be used to show existence of an offline BST algorithm whose cost is within a constant factor of the loginterleave bound of any permutation π.more » « less

null (Ed.)In this paper we consider the following sparse recovery problem. We have query access to a vector 𝐱 ∈ ℝ^N such that x̂ = 𝐅 𝐱 is ksparse (or nearly ksparse) for some orthogonal transform 𝐅. The goal is to output an approximation (in an 𝓁₂ sense) to x̂ in sublinear time. This problem has been wellstudied in the special case that 𝐅 is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time O(k ⋅ polylog N). However, for transforms 𝐅 other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled. In this paper we give sublineartime algorithms  running in time poly(k log(N))  for solving the sparse recovery problem for orthogonal transforms 𝐅 that arise from orthogonal polynomials. More precisely, our algorithm works for any 𝐅 that is an orthogonal polynomial transform derived from Jacobi polynomials. The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing. One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability. Our approach is to give a very general reduction from the ksparse sparse recovery problem to the 1sparse sparse recovery problem that holds for any flat orthogonal polynomial transform; then we solve this onesparse recovery problem for transforms derived from Jacobi polynomials. Frequently, sparse FFT algorithms are described as implementing such a reduction; however, the technical details of such works are quite specific to the Fourier transform and moreover the actual implementations of these algorithms do not use the 1sparse algorithm as a black box. In this work we give a reduction that works for a broad class of orthogonal polynomial families, and which uses any 1sparse recovery algorithm as a black box.more » « less