NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Profiling Hyperscale Big Data Processing

https://doi.org/10.1145/3579371.3589082

Gonzalez, Abraham; Kolli, Aasheesh; Khan, Samira; Liu, Sihang; Dadu, Vidushi; Karandikar, Sagar; Chang, Jichuan; Asanovic, Krste; Ranganathan, Parthasarathy (January 2023, Proceedings of the 50th Annual International Symposium on Computer Architecture)

Full Text Available
TaskStream: accelerating task-parallel workloads by recovering program structure

https://doi.org/10.1145/3503222.3507706

Dadu, Vidushi; Nowatzki, Tony (February 2022, ASPLOS 2022: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
Systematically Understanding Graph Accelerator Dimensions and the Value of Hardware Flexibility

https://doi.org/10.1109/MM.2022.3160862

Dadu, Vidushi; Liu, Sihao; Nowatzki, Tony (January 2022, IEEE Micro)

Full Text Available
Polygraph: Exposing the Value of Flexibility for Graph Processing Accelerators

Dadu, Vidushi; Liu, Sihao; Nowatzki, Tony (July 2021, ISCA)
null (Ed.)
Because of the importance of graph workloads and the limitations of CPUs/GPUs, many graph processing accelerators have been proposed. The basic approach of prior accelerators is to focus on a single graph algorithm variant (eg. bulk-synchronous + slicing). While helpful for specialization, this leaves performance potential from flexibility on the table and also complicates understanding the relationship between graph types, workloads, algorithms, and specialization. In this work, we explore the value of flexibility in graph processing accelerators. First, we identify a taxonomy of key algorithm variants. Then we develop a template architecture (PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each. Overall we find that flexibility in graph acceleration is critical. If only one variant can be supported, asynchronous-updates/priority-vertex-scheduling/graph-slicing is the best design, achieving 1.93× speedup over the best-performing accelerator, GraphPulse. However, static flexibility per-workload can further improve performance by 2.71×. With dynamic flexibility per-phase, performance further improves by up to 50%.
more » « less
Full Text Available
Towards General-Purpose Acceleration: Finding Structure in Irregularity

https://doi.org/10.1109/MM.2020.2986199

Dadu, Vidushi; Weng, Jian; Liu, Sihao; Nowatzki, Tony (May 2020, IEEE Micro)
null (Ed.)
Full Text Available
DSAGEN: Synthesizing Programmable Spatial Accelerators

https://doi.org/10.1109/ISCA45697.2020.00032

Weng, Jian; Liu, Sihao; Dadu, Vidushi; Wang, Zhengrong; Shah, Preyas; Nowatzki, Tony (July 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA))

Full Text Available
A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms

https://doi.org/10.1109/HPCA47549.2020.00063

Weng, Jian; Liu, Sihao; Wang, Zhengrong; Dadu, Vidushi; Nowatzki, Tony (February 2020, HPCA)

Abstract—Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine grain producer/consumer interaction with iteratively changing dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to fine-grain synchronization, and for vectorization due to the nonrectangular iteration domains. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional dataflow model with first class primitives for inductive dependences and memory access patterns (streams). Second, we develop a hybrid spatial architecture combining systolic and dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×—37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3.4× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.
more » « less
Full Text Available
Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms

https://doi.org/10.1145/3352460.3358276

Dadu, Vidushi; Weng, Jian; Liu, Sihao; Nowatzki, Tony (October 2019, MICRO 2019)

With slowing technology scaling, specialized accelerators are increasingly attractive solutions to continue expected generational scaling of performance. However, in order to accelerate more advanced algorithms or those from challenging domains, supporting \emph{data-dependence} becomes necessary. This manifests as either data-dependent control (eg. join two sparse lists), or data-dependent memory accesses (eg. hash-table access). These forms of data-dependence inherently couple compute with memory, and also preclude efficient vectorization -- defeating the traditional mechanisms of programmable accelerators (eg. GPUs). Our goal is to develop an accelerator which is broadly applicable across algorithms with and without data-dependence. To this end, we first identify forms of data-dependence which are both common and possible to exploit with specialized hardware: specifically stream-join and alias-free indirection. Then, we create an accelerator with an interface to support these, called the Sparse Processing Unit (SPU). SPU supports alias-free indirection with a compute-enabled scratchpad and aggressive stream reordering and stream-join with a novel dataflow control model for a reconfigurable systolic compute-fabric. Finally, we add robustness across datatypes by adding decomposability across the compute and memory pipelines. SPU achieves 16.5$$\times$$, 10.3x, and 14.2x over a 24-core SKL CPU on ML, database, and graph algorithms respectively. SPU achieves similar performance to domain-specific accelerators. For ML, SPU achieves 1.8-7x speedup against a similarly provisioned GPGPU, with much less area and power.
more » « less
Full Text Available
DAEGEN: A Modular Compiler for Exploring Decoupled Spatial Accelerators

https://doi.org/10.1109/LCA.2019.2955456

Weng, Jian; Liu, Sihao; Dadu, Vidushi; Nowatzki, Tony (July 2019, IEEE Computer Architecture Letters)

Full Text Available

Search for: All records