Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
In this article, we show how a single function,join, can be used to implement parallelbalanced binary search trees(BSTs) simply and efficiently. Based onjoin, our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework calledjoin-based algorithms. We show that thejoinfunction fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to asjoinabletrees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, includinginsert,delete,union,intersection,difference,split,range,filter, and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operationsunion,intersection, anddifferencehave work\( O(m\log (\frac{n}{m}+1)) \)and polylogarithmic span for input set sizes\( n \)and\( m\le n \).
We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73\( \times \)speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.
Anderson, Daniel; Blelloch, Guy E.; Wei, Yuanhao(
, Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation)
Safe memory reclamation (SMR) schemes are an essential
tool for lock-free data structures and concurrent programming.
However, manual SMR schemes are notoriously difficult
to apply correctly, and automatic schemes, such as
reference counting, have been argued for over a decade to
be too slow for practical purposes. A recent wave of work
has disproved this long-held notion and shown that reference
counting can be as scalable as hazard pointers, one of
the most common manual techniques. Despite these tremendous
improvements, there remains a gap of up to 2x or more
in performance between these schemes and faster manual
techniques such as epoch-based reclamation (EBR).
In this work, we first advance these ideas and show that in
many cases, automatic reference counting can in fact be as
fast as the fastest manual SMR techniques.We generalize our
previous algorithm called Concurrent Deferred Reference
Counting (CDRC) to obtain a method for converting any
standard manual SMR technique into an automatic reference
counting technique with a similar performance profile. Our
second contribution is extending this framework to support
weak pointers, which are reference-counted pointers that
automatically break pointer cycles by not contributing to
the reference count, thus addressing a common weakness in
reference-counted garbage collection.
Our experiments with a C++-library implementation show
that our automatic techniques perform in line with their manual
counterparts, and that our weak pointer implementation
outperforms the best known atomic weak pointer library
by up to an order of magnitude on high thread counts. All
together, we show that the ease of use of automatic memory
management can be achieved without significant cost to
practical performance or general applicability.
Dhulipala, Laxman; Blelloch, Guy E.; Gu, Yan; Sun, Yihan(
, ACM SIGPLAN International Conference on Programming Language Design and Implementation)
Ranjit Jhala and Isil Dillig
(Ed.)
Many modern programming languages are shifting toward
a functional style for collection interfaces such as sets, maps,
and sequences. Functional interfaces offer many advantages,
including being safe for parallelism and providing simple and
lightweight snapshots. However, existing high-performance
functional interfaces such as PAM, which are based on bal-
anced purely-functional trees, incur large space overheads
for large-scale data analysis due to storing every element in
a separate node in a tree.
This paper presents PaC-trees, a purely-functional data
structure supporting functional interfaces for sets, maps, and
sequences that provides a significant reduction in space over
existing approaches. A PaC-tree is a balanced binary search
tree which blocks the leaves and compresses the blocks us-
ing arrays. We provide novel techniques for compressing
and uncompressing the blocks which yield practical parallel
functional algorithms for a broad set of operations on PaC-
trees such as union, intersection, filter, reduction, and range
queries which are both theoretically and practically efficient.
Using PaC-trees we designed CPAM, a C++ library that im-
plements the full functionality of PAM, while offering signifi-
cant extra functionality for compression. CPAM consistently
matches or outperforms PAM on a set of microbenchmarks
on sets, maps, and sequences while using about a quarter
of the space. On applications including inverted indices, 2D
range queries, and 1D interval queries, CPAM is competitive
with or faster than PAM, while using 2.1ś7.8x less space.
For static and streaming graph processing, CPAM offers 1.6x
faster batch updates while using 1.3ś2.6x less space than the
state-of-the-art graph processing system Aspen.
Anderson, Daniel; Blelloch, Guy E.; Dhulipala, Laxman; Dobson, Magdalen; Sun, Yihan(
, Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
The Problem-Based Benchmark Suite (PBBS) is a set of benchmark
problems designed for comparing algorithms, implementations
and platforms. For each problem, the suite defines
the problem in terms of the input-output relationship, and
supplies a set of input instances along with input generators,
a default implementation, code for checking correctness or
accuracy, and a timing harness. The suite makes it possible
to compare different algorithms, platforms (e.g. GPU vs
CPU), and implementations using different programming
languages or libraries. The purpose is to better understand
how well a wide variety of problems parallelize, and what
techniques/algorithms are most effective.
The suite was first announced in 2012 with 14 benchmark
problems. Here we describe some significant updates. In particular,
we have added nine new benchmarks from a mix of
problems in text processing, computational geometry and
machine learning. We have further optimized the default
implementations; several are the fastest available for multicore
CPUs, often achieving near perfect speedup on the 72
core machine we test them on. The suite now also supplies
significantly larger default test instances, as well as a broader
variety, with many derived from real-world data.
Ben-David, Naama; Blelloch, Guy E.; Wei, Yuanhao(
, Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
This paper presents a new and practical approach to lock-free
locks based on helping, which allows the user to write
code using fine-grained locks, but run it in a lock-free manner.
Although lock-free locks have been suggested in the
past, they are widely viewed as impractical, have some key
limitations, and, as far as we know, have never been implemented.
The paper presents some key techniques that make
lock-free locks practical and more general. The most important
technique is an approach to idempotence—i.e. making
code that runs multiple times appear as if it ran once. The
idea is based on using a shared log among processes running
the same protected code. Importantly, the approach can
be library based, requiring very little if any change to standard
code—code just needs to use the idempotent versions
of memory operations (load, store, LL/SC, allocation, free).
We have implemented a C++ library called Flock based
on the ideas. Flock allows lock-based data structures to run
in either lock-free or blocking (traditional locks) mode. We
implemented a variety of tree and list-based data structures
with Flock and compare the performance of the lock-free
and blocking modes under a variety of workloads. The lock-free
mode is almost as fast as blocking mode under almost
all workloads, and significantly faster when threads are oversubscribed
(more threads than processors). We also compare
with several existing lock-based and lock-free alternatives.
Wei, Yuanhao; Ben-David, Naama; Friedman, Michal; Blelloch, Guy E.; Petrank, Erez(
, Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
Non-volatile random access memory (NVRAM) offers byte-addressable persistence at speeds comparable to DRAM. However, with caches remaining volatile, automatic cache evictions can reorder updates to memory, potentially leaving persistent memory in an inconsistent state upon a system crash. Flush and fence instructions can be used to force ordering among updates, but are expensive. This has motivated significant work studying how to write correct and efficient persistent programs for NVRAM.
In this paper, we present FliT, a C++ library that facilitates writing efficient persistent code. Using the library's default mode makes any linearizable data structure durable with minimal changes to the code. FliT avoids many redundant flush instructions by using a novel algorithm to track dirty cache lines. It also allows for extra optimizations, but achieves good performance even in its default setting.
To describe the FliT library's capabilities and guarantees, we define a persistent programming interface, called the P-V Interface, which FliT implements. The P-V Interface captures the expected behavior of code in which some instructions' effects are persisted and some are not. We show that the interface captures the desired semantics of many practical algorithms in the literature.
We apply the FliT library to four different persistent data structures, and show that across several workloads, persistence implementations, and data structure sizes, the FliT library always improves operation throughput, by at least 2.1X over a naive implementation in all but one workload.
Westrick, Sam; Rainey, Mike; Anderson, Daniel; Blelloch, Guy E.(
, ACM Symposium on Principles and Practice of Parallel Programming)
Jaejin Lee, Kunal Agrawal
(Ed.)
Programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of parallelism because such functions are naturally parallel. However, if implemented naively they lead to the generation of temporary intermediate collections that can significantly increase memory usage and runtime. To avoid this pitfall, many approaches use "fusion" to combine operations and avoid temporary results. However, most of these approaches involve significant changes to a compiler and are limited to a small set of functions, such as maps and reduces.
In this paper we present a library-based approach that fuses widely used operations such as scans, filters, and flattens. In conjunction with existing techniques, this covers most of the common operations on collections. Our approach is based on a novel technique which parallelizes over blocks, with streams within each block. We demonstrate the approach by implementing libraries targeting multicore parallelism in two languages: Parallel ML and C++, which have very different semantics and compilers. To help users understand when to use the approach, we define a cost semantics that indicates when fusion occurs and how it reduces memory allocations. We present experimental results for a dozen benchmarks that demonstrate significant reductions in both time and space. In most cases the approach generates code that is near optimal for the machines it is running on.
Blelloch, Guy E.; Dobson, Magdalen(
, 2022 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX))
We present a set of parallel algorithms for computing exact k-nearest neighbors in low dimensions. Many k-nearest neighbor algorithms use either a kd-tree or the Morton ordering of the point set; our algorithms combine these approaches using a data structure we call the zd-tree. We show that this combination is both theoretically efficient under common assumptions, and fast in practice. For point sets of size n with bounded expansion constant and bounded ratio, the zd-tree can be built in O(n) work with O(n^ε) span for constant ε < 1, and searching for the k-nearest neighbors of a point takes expected O(k log k) time. We benchmark our k-nearest neighbor algorithms against existing parallel k-nearest neighbor algorithms, showing that our implementations are generally faster than the state of the art as well as achieving 75x speedup on 144 hyperthreads. Furthermore, the zd-tree supports parallel batch-dynamic insertions and deletions; to our knowledge, it is the first k-nearest neighbor data structure to support such updates. On point sets with bounded expansion constant and bounded ratio, a batch-dynamic update of size k requires O(k log n/k) work with O(k^ε + polylog(n)) span.
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.