NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ReACT: Redundancy-Aware Code Generation for Tensor Expressions

https://doi.org/10.1145/3559009.3569685

Zhou, Tong; Tian, Ruiqin; Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen; Sarkar, Vivek (October 2022, ACM)
GPU Subwarp Interleaving

https://doi.org/10.1109/HPCA53966.2022.00090

Damani, Sana; Stephenson, Mark; Rangan, Ram; Johnson, Daniel; Kulkami, Rishkul; Keckler, Stephen W. (April 2022, IEEE)
Memory access scheduling to reduce thread migrations

https://doi.org/10.1145/3497776.3517768

Damani, Sana; Barua, Prithayan; Sarkar, Vivek (March 2022, ACM)
Task-graph scheduling extensions for efficient synchronization and communication

https://doi.org/10.1145/3447818.3461616

Bak, Seonmyeong; Hernandez, Oscar; Gates, Mark; Luszczek, Piotr; Sarkar, Vivek (June 2021, 35th ACM International Conference on Supercomputing (ICS))

Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
more » « less
Full Text Available
Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

Chatarasi, Prasanth; Neuendorffer, Stephen; Bayliss, Samuel; Vissers, Kees; Sarkar, Vivek (September 2020, 2020 IEEE High Performance Extreme Computing Virtual Conference)
null (Ed.)
Xilinx’s AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly- level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D workloads, and achieved geometric means of 7.6 and 24.2 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 75.6% of the peak performance respectively).
more » « less
Full Text Available
OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing

https://doi.org/10.1007/978-3-030-57675-2_13

Barua, Prithayan; Zhao, Jisheng; Sarkar, Vivek (August 2020, European Conference on Parallel Processing (Euro-Par 2020))
null (Ed.)
The fast development of acceleration architectures and applications has made heterogeneous computing the norm for high- performance computing. The cost of high volume data movement to the accelerators is an important bottleneck both in terms of application performance and developer productivity. Memory management is still a manual task performed tediously by expert programmers. In this paper, we develop a compiler analysis to automate memory management for heterogeneous computing. We propose an optimization framework that casts the problem of detection and removal of redundant data move- ments into a partial redundancy elimination (PRE) problem and applies the lazy code motion technique to optimize these data movements. We chose OpenMP as the underlying parallel programming model and imple- mented our optimization framework in the LLVM toolchain. We evalu- ated it with ten benchmarks and obtained a geometric speedup of 2.3×, and reduced on average 50% of the total bytes transferred between the host and GPU.
more » « less
Full Text Available
MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings

https://doi.org/10.1109/MM.2020.2985963

Kwon, Hyoukjun; Chatarasi, Prasanth; Sarkar, Vivek; Krishna, Tushar; Pellauer, Michael; Parashar, Angshuman (May 2020, IEEE Micro)
null (Ed.)
Full Text Available
Experimental Insights from the Rogues Gallery

https://doi.org/10.1109/ICRC.2019.8914707

Young, Jeffrey S.; Riedy, Jason; Conte, Thomas M.; Sarkar, Vivek; Chatarasi, Prasanth; Srikanth, Sriseshan (November 2019, IEEE International Conference on Rebooting Computing (ICRC 2020))
null (Ed.)
The Rogues Gallery is a new deployment for understanding next-generation hardware with a focus on unorthodox and uncommon technologies. This testbed project was initiated in 2017 in response to Rebooting Computing efforts and initiatives. The Gallery's focus is to acquire new and unique hardware (the rogues) from vendors, research labs, and start-ups and to make this hardware widely available to students, faculty, and industry collaborators within a managed data center environment. By exposing students and researchers to this set of unique hardware, we hope to foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore's Law era of cheap transistors ends. We have defined an initial vision of the infrastructure and driving engineering challenges for such a testbed in a separate document, so here we present highlights of the first one to two years of post-Moore era research with the Rogues Gallery and give an indication of where we see future growth for this testbed and related efforts.
more » « less
Full Text Available
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach

https://doi.org/10.1145/3352460.3358252

Kwon, Hyoukjun; Chatarasi, Prasanth; Pellauer, Michael; Parashar, Angshuman; Sarkar, Vivek; Krishna, Tushar (October 2019, MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture)
null (Ed.)
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator micro architecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflow, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.
more » « less
Full Text Available
A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System

Chatarasi, Prasanth; Sarkar, Vivek (January 2018, MCHPC'18: Proceedings of the Workshop on Memory Centric High Performance Computing)

Unlike dense linear algebra applications, graph applications typically suffer from poor performance because of 1) inefficient utilization of memory systems through random memory accesses to graph data, and 2) overhead of executing atomic operations. Hence, there is a rapid growth in improving both software and hardware platforms to address the above challenges. One such improvement in the hardware platform is a realization of the Emu system, a thread migratory and near-memory processor. In the Emu system, a thread responsible for computation on a datum is automatically migrated over to a node where the data resides without any intervention from the programmer. The idea of thread migrations is very well suited to graph applications as memory accesses of the applications are irregular. However, thread migrations can hurt the performance of graph applications if overhead from the migrations dominates benefits achieved through the migrations. In this preliminary study, we explore two high-level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations. We performed a preliminary evaluation of these compiler transformations by manually applying them on three graph applications over a set of RMAT graphs from Graph500.---Conductance, Bellman-Ford's algorithm for the single-source shortest path problem, and Triangle Counting. Our evaluation targeted a single node of the Emu hardware prototype, and has shown an overall geometric mean reduction of 22.08% in thread migrations.
more » « less
Full Text Available

Search for: All records