Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

For any forest G = (V, E) it is possible to orient the edges E so that no vertex in V has outdegree greater than 1. This paper considers the incremental edgeorientation problem, in which the edges E arrive over time and the algorithm must maintain a lowoutdegree edge orientation at all times. We give an algorithm that maintains a maximum outdegree of 3 while flipping at most O(log log n) edge orientations per edge insertion, with high probability in n. The algorithm requires worstcase time O(log n log log n) per insertion, and takes amortized time O(1). The previous state of the art required up to O(log n/ log log n) edge flips per insertion. We then apply our edgeorientation results to the problem of dynamic Cuckoo hashing. The problem of designing simple families H of hash functions that are compatible with Cuckoo hashing has received extensive attention. These families H are known to satisfy static guarantees, but do not come typically with dynamic guarantees for the running time of inserts and deletes. We show how to transform static guarantees (for 1associativity) into nearstateoftheart dynamic guarantees (for O(1)associativity) in a blackbox fashion. Rather than relying on the family Hmore »

The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging. In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparisonbased sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplicationmore »

Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard 15 drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmwareinternal operations. The Diskaccess Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the halfbandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of Btrees in the 1970s and the current popularity of Bε trees and logstructured merge trees. But it fails to explain why some Btrees use small nodes, whereas all Bε trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that themore »

The classical paging problem, introduced by Sleator and Tarjan in 1985, formalizes the problem of caching pages in RAM in order to minimize IOs. Their online formulation ignores the cost of address translation: programs refer to data via virtual addresses, and these must be translated into physical locations in RAM. Although the cost of an individual address translation is much smaller than that of an IO, every memory access involves an address translation, whereas IOs can be infrequent. In practice, one can spend money to avoid paging by overprovisioning RAM; in contrast, address translation is effectively unavoidable. Thus addresstranslation costs can sometimes dominate paging costs, and systems must simultane ously optimize both. To mitigate the cost of address translation, all modern CPUs have translation lookaside buffers (TLBs), which are hardware caches of common address translations. What makes TLBs interesting is that a single TLB entry can potentially encode the address translation for many addresses. This is typically achieved via the use of huge pages, which translate runs of contiguous virtual addresses to runs of contiguous physical addresses. Huge pages reduce TLB misses at the cost of increasing the IOs needed to maintain contiguity in RAM. This tradeoff between TLB missesmore »

Today’s filters, such as quotient, cuckoo, and Morton, have a tradeoff between space and speed; even when moderately full (e.g., 50%75% full), their performance degrades nontrivially. The result is that today’s systems designers are forced to choose between speed and space usage. In this paper, we present the vector quotient filter (VQF). Locally, the VQF is based on Robin Hood hashing, like the quotient filter, but uses poweroftwochoices hashing to reduce the variance of runs, and thus others consistent, high throughput across load factors. Poweroftwochoices hashing also makes it more amenable to concurrent updates, compared to the cuckoo filter and variants. Finally, the vector quotient filter is designed to exploit SIMD instructions so that all operations have $ (1) cost, independent of the size of the filter or its load factor. We show that the vector quotient flter is 2x faster for inserts compared to the Morton filter (a cuckoo filter variant and stateof theart for inserts) and has similar lookup and deletion performance as the cuckoo filter (which is fastest for queries and deletes), despite having a simpler design and implementation. The vector quotient filter has minimal performance decline at high load factors, a problem that has plagued modernmore »

A filter is adaptive if it achieves a false positive rate of " on each query independently of the answers to previous queries. Many popular filters such as Bloom filters are not adaptive—an adversary could repeat a falsepositive query many times to drive the falsepositive rate to 1. Bender et al. [4] formalized the definition of adaptivity and gave a provably adaptive filter, the broom filter. Mitzenmacher et al. [20] gave a filter that achieves a lower empirical false positive rate by exploiting repetitions. We prove that an adaptive filter has a lower false positive rate when the adversary is stochastic. Specifically, we analyze the broom filter against queries drawn from a Zipfian distribution. We validate our analysis empirically by showing that the broom filter achieves a low falsepositive rate on both network traces and synthetic datasets, even when compared to a regular filter augmented with a cache for storing frequently queried items.

In the parallel paging problem, there are p processors that share a cache of size k. The goal is to partition the cache among the processors over time in order to minimize their average completion time. For this longstanding open problem, we give tight upper and lower bounds of ⇥(log p) on the competitive ratio with O(1) resource augmentation. A key idea in both our algorithms and lower bounds is to relate the problem of parallel paging to the seemingly unrelated problem of green paging. In green paging, there is an energyoptimized processor that can temporarily turn off one or more of its cache banks (thereby reducing power consumption), so that the cache size varies between a maximum size k and a minimum size k/p. The goal is to minimize the total energy consumed by the computation, which is proportional to the integral of the cache size over time. We show that any efficient solution to green paging can be converted into an efficient solution to parallel paging, and that any lower bound for green paging can be converted into a lower bound for parallel paging, in both cases in a blackbox fashion. We then show that, with O(1) resourcemore »

In the unitcost comparison model, a black box takes an input two items and outputs the result of the comparison. Problems like sorting and searching have been studied in this model, and it has been general ized to include the concept of priced information, where different pairs of items (say database records) have different comparison costs. These comparison costs can be arbitrary (in which case no algorithm can be close to optimal (Charikar et al. STOC 2000)), structured (for exam ple, the comparison cost may depend on the length of the databases (Gupta et al. FOCS 2001)), or stochastic (Angelov et al. LATIN 2008). Motivated by the database setting where the cost depends on the sizes of the items, we consider the problems of sorting and batched predecessor where two nonuniform sets of items A and B are given as input. (1) In the RAM setting, we consider the scenario where both sets have n keys each. The cost to compare two items in A is a, to compare an item of A to an item of B is b, and to compare two items in B is c. We give upper and lower bounds for the case a ≤more »

Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The stateoftheart techniques in this area fall into three groups: cacheaware tiled looping algorithms, cacheoblivious divideandconquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations ofmore »

Making logical copies, or clones, of files and directories is critical to many realworld applications and work flows, including backups, virtual machines, and containers. An ideal clone implementation meets the follow ing performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall sys tem is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a longstanding open problem. This article describes nimble clones in Bεtree File System (BetrFS), an opensource, fullpathindexed, and writeoptimized file system. The key observation behind our work is that standard copyonwrite heuristics can be too coarse to be space efficient, or too finegrained to preserve locality. On the other hand, a write optimized keyvalue store, such as a Bε tree or an logstructured mergetree (LSM)tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our writeoptimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copyonabundantwrite. We demonstrate thatmore »