The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging. In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparisonbased sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplicationmore »
Data Oblivious Algorithms for Multicores
A dataoblivious algorithm is an algorithm whose memory access pattern is independent of the input values. We initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary forkjoin model of computation. We present a dataoblivious CREW binary forkjoin sorting algorithm with optimal total work and optimal (cacheoblivious) cache complexity, and in O(łog n łog łog n) span (i.e., parallel time); these bounds match the bestknown bounds for binary forkjoin cacheefficient insecure algorithms. Using our sorting algorithm as a core primitive, we show how to dataobliviously simulate general PRAM algorithms in the binary forkjoin model with nontrivial efficiency, and we present dataoblivious algorithms for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. All of our data oblivious algorithms have bounds that either match or improve over the best known bounds for insecure algorithms.
Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is selfcontained and potentially implementable. It has optimal caching cost, and it is only a łog łog n factor off from optimal work and about a łog n factor off in terms of span. We also present an EREW variant with more »
 Editors:
 Azar, Yossi
 Award ID(s):
 2008241
 Publication Date:
 NSFPAR ID:
 10278935
 Journal Name:
 SPAA '21
 Page Range or eLocationID:
 373 to 384
 Sponsoring Org:
 National Science Foundation
More Like this


The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging. In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparisonbased sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplicationmore »

We consider a parallel computational model, the Parallel Persistent Memory model, comprised of P processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault at any time (with bounded probability), and possibly restart. When a processor faults, all of its state and local ephemeral memory is lost, but the persistent memory remains. This model is motivated by upcoming nonvolatile memories that are nearly as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. We present several results for the model, using an approach that breaks a computation into capsules, each of which can be safely run multiple times. For the singleprocessor version we describe how to simulate any program in the RAM, the external memory model, or the ideal cache model with an expected constant factor overhead. For the multiprocessor version we describe how to efficiently implement a workstealing scheduler within the model such that it handles both soft faults, withmore »

Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The stateoftheart techniques in this area fall into three groups: cacheaware tiled looping algorithms, cacheoblivious divideandconquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations ofmore »

If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values  various racedetection tools have been designed to find such bugs. A key component of race detectors is an algorithm for seriesparallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSPOrder, for performing SP maintenance in programs with forkjoin (or nested) parallelism. Given a forkjoin program with T1 work and T∞ span, WSPOrder executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSPOrder is a workstealing scheduler designed specifically for SP maintenance. We also implemented CRACER, a racedetector based on WSPOrder within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without racedetection.