NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

https://doi.org/10.1145/3686163

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Ji, Shixin; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (August 2024, ACM Transactions on Reconfigurable Technology and Systems)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to composemultiple diverse MM accelerator architecturesworking concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\(\times\), 32.51\(\times\), 1.00\(\times\), and 1.00\(\times\)throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:https://github.com/arc-research-lab/CHARM.
more » « less
Full Text Available
TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

https://doi.org/10.1145/3609335

Guo, Licheng; Chi, Yuze; Lau, Jason; Song, Linghao; Tian, Xingyu; Khatti, Moazin; Qiao, Weikang; Wang, Jie; Ustun, Ecenur; Fang, Zhenman; et al (December 2023, ACM Transactions on Reconfigurable Technology and Systems)

In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allows users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments, we make the originally unroutable designs achieve 274 MHz, on average. The framework is available athttps://github.com/UCLA-VAST/tapaand the core floorplan module is available athttps://github.com/UCLA-VAST/AutoBridge
more » « less
Full Text Available
Development of a Process Chain to Measure Freeform Optics using an Optical Coordinate Measuring Machine

Ferguson, Matthew M; Nikolov, Daniel K; Pomerantz, Mike; Wolfs, Franciscus; Novak, Spencer; Lee, Chanseung; Lau, Jason; Davies, Matthew A; Rolland, Jannick P (June 2023, Advanced Metrology Systems I at Optica Design and Fabrication)

A process chain for measuring and analyzing freeform optics is developed for the OptiPro UltraSurf 5x 400 optical coordinate measuring machine. A test case of molded BD6 Alvarez lenses demonstrates measurement repeatability and reproducibility.
more » « less
CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

https://doi.org/10.1145/3543622.3573210

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Du, Yubo; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (February 2023, Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
more » « less
Full Text Available
Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

https://doi.org/10.1145/3490422.3502357

Song, Linghao; Chi, Yuze; Sohrabizadeh, Atefeh; Choi, Young-kyu; Lau, Jason; Cong, Jason (February 2022, FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays)

Full Text Available
Extending High-Level Synthesis for Task-Parallel Programs

https://doi.org/10.1109/FCCM51124.2021.00032

Chi, Yuze; Guo, Licheng; Lau, Jason; Choi, Young-kyu; Wang, Jie; Cong, Jason (May 2021, Proceedings of the 29th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’21))
null (Ed.)
Full Text Available
AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

https://doi.org/10.1145/3431920.3439289

Guo, Licheng; Chi, Yuze; Wang, Jie; Lau, Jason; Qiao, Weikang; Ustun, Ecenur; Zhang, Zhiru; Cong, Jason (February 2021, Proceedings of the 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’21), Best Paper Award)
null (Ed.)
Full Text Available
HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

https://doi.org/10.1145/3377811.3380340

Lau, Jason; Sivaraman, Aishwarya; Zhang, Qian; Gulzar, Muhammad Ali; Cong, Jason; Kim, Miryung (May 2020, Proceedings of 42nd International Conference on Software Engineering)

Full Text Available
Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

https://doi.org/10.1145/3373087.3375332

Guo, Licheng; Lau, Jason; Chi, Yuze; Wang, Jie; Yu, Cody Hao; Chen, Zhe; Zhang, Zhiru; Cong, Jason (July 2020, Proceedings of the 57th Design Automation Conference (DAC 2020), San Francisco, CA)

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in a diverse set of realistic and complex FPGA HLS designs. (1) We observe that in almost all cases the frequency degradation is caused by the broadcast structures generated by the HLS compiler. (2)We classify three major types of broadcasts in HLS-generated designs, including high-fanout data signals, pipeline flow control signals and synchronization signals for concurrent modules. (3) We reveal a number of limitations of the current HLS tools that result in those broadcast-related timing issues. (4) We propose a set of effective yet easy-to-implement approaches, including broadcast-aware scheduling, synchronization pruning, and skid-buffer-based flow control. Our experimental results show that our methods can improve the maximum frequency of a set of nine representative HLS benchmarks by 53% on average. In some cases, the frequency gain is more than 100 MHz.
more » « less
Full Text Available

Search for: All records