In the parallel paging problem, there are p processors that share a cache of size k. The goal is to partition the cache among the processors over time in order to minimize their average completion time. For this longstanding open problem, we give tight upper and lower bounds of ⇥(log p) on the competitive ratio with O(1) resource augmentation.
A key idea in both our algorithms and lower bounds is to relate the problem of parallel paging to the seemingly unrelated problem of green paging. In green paging, there is an energyoptimized processor that can temporarily turn off one or more of its cache banks (thereby reducing power consumption), so that the cache size varies between a maximum size k and a minimum size k/p. The goal is to minimize the total energy consumed by the computation, which is proportional to the integral of the cache size over time.
We show that any efficient solution to green paging can be converted into an efficient solution to parallel paging, and that any lower bound for green paging can be converted into a lower bound for parallel paging, in both cases in a blackbox fashion. We then show that, with O(1) resource augmentation, the optimal competitive ratio for deterministic online green paging is ⇥(log p), which, in turn, implies the same bounds for deterministic online parallel paging.
more »
« less
The Parallel Persistent Memory Model
We consider a parallel computational model, the Parallel Persistent
Memory model, comprised of P processors, each with a fast local
ephemeral memory of limited size, and sharing a large persistent
memory. The model allows for each processor to fault at any time
(with bounded probability), and possibly restart. When a processor
faults, all of its state and local ephemeral memory is lost, but the
persistent memory remains. This model is motivated by upcoming
nonvolatile memories that are nearly as fast as existing random
access memory, are accessible at the granularity of cache lines,
and have the capability of surviving power outages. It is further
motivated by the observation that in large parallel systems, failure
of processors and their caches is not unusual.
We present several results for the model, using an approach that
breaks a computation into capsules, each of which can be safely run
multiple times. For the singleprocessor version we describe how
to simulate any program in the RAM, the external memory model,
or the ideal cache model with an expected constant factor overhead.
For the multiprocessor version we describe how to efficiently
implement a workstealing scheduler within the model such that
it handles both soft faults, with a processor restarting, and hard
faults, with a processor permanently failing. For any multithreaded
forkjoin computation that is race free, writeafterread conflict
free and has W work, D depth, and C maximum capsule work in
the absence of faults, the scheduler guarantees a time bound on
the model of O(W/P_A+ (DP/P_A ) log_{1/(Cf )} W) in expectation, where P
is the maximum number of processors, P_A is the average number,
and f ≤ 1/(2C) is the probability a processor faults between successive
persistent memory accesses. Within the model, and using
the proposed methods, we develop efficient algorithms for parallel
prefix sums, merging, sorting, and matrix multiply.
more »
« less
 Award ID(s):
 1533858
 NSFPAR ID:
 10080493
 Date Published:
 Journal Name:
 ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)
 Volume:
 30
 Page Range / eLocation ID:
 247 to 258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


he noisy broadcast model was first studied by [Gallager, 1988] where an ncharacter input is distributed among n processors, so that each processor receives one input bit. Computation proceeds in rounds, where in each round each processor broadcasts a single character, and each reception is corrupted independently at random with some probability p. [Gallager, 1988] gave an algorithm for all processors to learn the input in O(log log n) rounds with high probability. Later, a matching lower bound of Omega(log log n) was given by [Goyal et al., 2008]. We study a relaxed version of this model where each reception is erased and replaced with a `?' independently with probability p, so the processors have knowledge of whether a bit has been corrupted. In this relaxed model, we break past the lower bound of [Goyal et al., 2008] and obtain an O(log^* n)round algorithm for all processors to learn the input with high probability. We also show an O(1)round algorithm for the same problem when the alphabet size is Omega(poly(n)).more » « less

null (Ed.)In the parallel paging problem, there are $\pP$ processors that share a cache of size $k$. The goal is to partition the cache among the \procs over time in order to minimize their average completion time. For this longstanding open problem, we give tight upper and lower bounds of $\Theta(\log \p)$ on the competitive ratio with $O(1)$ resource augmentation. A key idea in both our algorithms and lower bounds is to relate the problem of parallel paging to the seemingly unrelated problem of green paging. In green paging, there is an energyoptimized processor that can temporarily turn off one or more of its cache banks (thereby reducing power consumption), so that the cache size varies between a maximum size $k$ and a minimum size $k/\p$. The goal is to minimize the total energy consumed by the computation, which is proportional to the integral of the cache size over time. We show that any efficient solution to green paging can be converted into an efficient solution to parallel paging, and that any lower bound for green paging can be converted into a lower bound for parallel paging, in both cases in a blackbox fashion. We then show that, with $O(1)$ resource augmentation, the optimal competitive ratio for deterministic online green paging is $\Theta(\log \p)$, which, in turn, implies the same bounds for deterministic online parallel paging.more » « less

null (Ed.)Because of its many desirable properties, such as its ability to control effects and thus potentially disastrous race conditions, functional programming offers a viable approach to programming modern multicore computers. Over the past decade several parallel functional languages, typically based on dialects of ML and Haskell, have been developed. These languages, however, have traditionally underperformed procedural languages (such as C and Java). The primary reason for this is their hunger for memory, which only grows with parallelism, causing traditional memory management techniques to buckle under increased demand for memory. Recent work opened a new angle of attack on this problem by identifying a memory property of determinacyracefree parallel programs, called disentanglement, which limits the knowledge of concurrent computations about each other’s memory allocations. The work has showed some promise in delivering good time scalability. In this paper, we present provably spaceefficient automatic memory management techniques for determinacyracefree functional parallel programs, allowing both pure and imperative programs where memory may be destructively updated. We prove that for a program with sequential live memory of R * , any P processor garbagecollected parallel run requires at most O ( R * · P ) memory. We also prove a work bound of O ( W + R * P ) for P processor executions, accounting also for the cost of garbage collection. To achieve these results, we integrate thread scheduling with memory management. The idea is to coordinate memory allocation and garbage collection with thread scheduling decisions so that each processor can allocate memory without synchronization and independently collect a portion of memory by consulting a collection policy, which we formulate. The collection policy is fully distributed and does not require communicating with other processors. We show that the approach is practical by implementing it as an extension to the MPL compiler for Parallel ML. Our experimental results confirm our theoretical bounds and show that the techniques perform and scale well.more » « less

Based on improvements to an existing threestep model for cache timingbased attacks, this work presents 88Strongtypes of theoretical timingbased vulnerabilities in processor caches. It also presents and implements anew benchmark suite that can be used to test if processor cache is vulnerable to one of the attacks. In total,there are 1094 automaticallygenerated test programs which cover the 88Strongtheoretical vulnerabilities. The benchmark suite generates the Cache Timing Vulnerability Score (CTVS) which can be used to evaluate how vulnerable a specific cache implementation is to different attacks. A smaller CTVS means the design is more secure.Evaluation is conducted on commodity Intel and AMD processors and shows how the differences in processor implementations can result in different types of attacks that they are vulnerable to. Further, the benchmarks and the CTVS can be used in simulation to help designers of new secure processors and caches evaluate their designs’ susceptibility to cache timingbased attacks.more » « less