We propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks finegrained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5x for sparse logistic regression to over 100x on distributed geospatial queries.
more »
« less
Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures
Sparse data structures like hash tables, trees, or compressed tensors are ubiquitous, but operations on these structures are expensive and inefficient on current systems. Prior work has proposed hardware acceleration for these operations, but these techniques have two key shortcomings: they limit the types of data structures they support, and they focus on reads but do not support fine-grained updates to these structures. We present Terminus, a programmable accelerator for read and update operations on sparse data structures. Terminus extends each general-purpose core with a programmable dataflow engine capable of accelerating a wide range of structures and operations. Terminus engines are flexible yet simple, as they focus on common operations and defer rare, complex ones to cores. Terminus features a simple concurrency control mechanism based on address ranges that enables safe updates while preserving parallelism. We evaluate Terminus on serial and parallel benchmarks on a wide range of sparse data structures. Terminus improves performance by gmean 7.4x over a CPU baseline, showing that Terminus can accelerate fine-grained reads and writes that were previously not possible in prior accelerators for sparse structures.
more »
« less
- Award ID(s):
- 2217099
- PAR ID:
- 10638827
- Publisher / Repository:
- IEEE
- Date Published:
- Page Range / eLocation ID:
- 1233 to 1246
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many applications are designed to perform traversals ontree-likedata structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new frameworkOrchardbuilt on top ofGrafter.Orchard’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applyingfine-grainedfusion and extractingheterogeneousparallelism.Orchardallows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.more » « less
-
Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.more » « less
-
In this work, we investigate stochastic approximation (SA) with Markovian data and nonlinear updates under constant stepsize. Existing work has primarily focused on either i.i.d. data or linear update rules. We take a new perspective and carefully examine the simultaneous presence of Markovian dependency of data and nonlinear update rules, delineating how the interplay between these two structures leads to complications that are not captured by prior techniques. By leveraging the smoothness and recurrence properties of the SA updates, we develop a fine-grained analysis of the correlation between the SA iterates and Markovian data. This enables us to overcome the obstacles in existing analysis and establish for the first time the weak convergence of the joint process. Furthermore, we present a precise characterization of the asymptotic bias of the SA iterates. As a by-product of our analysis, we derive finite-time bounds on higher moment and present non-asymptotic geometric convergence rates for the iterates, along with a Central Limit Theorem.more » « less
-
We develop a distributed-memory parallel algorithm for performing batch updates on streaming graphs, where vertices and edges are continuously added or removed. Our algorithm leverages distributed sparse matrices as the core data structures, utilizing equivalent sparse matrix operations to execute graph updates. By reducing unnecessary communication among processes and employing shared-memory parallelism, we accelerate updates of distributed graphs. Additionally, we maintain a balanced load in the output matrix by permuting the resultant matrix during the update process. We demonstrate that our streaming update algorithm is at least 25 times faster than alternative linear-algebraic methods and scales linearly up to 4,096 cores (32 nodes) on a Cray EX supercomputer.more » « less
An official website of the United States government

