NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AutoAI2C: An Automated Hardware Generator for DNN Acceleration on Both FPGA and ASIC

https://doi.org/10.1109/TCAD.2024.3393428

Zhang, Yongan; Zhang, Xiaofan; Xu, Pengfei; Zhao, Yang; Hao, Cong; Chen, Deming; Lin, Yingyan Celine (October 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

https://doi.org/10.1145/3686163

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Ji, Shixin; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (August 2024, ACM Transactions on Reconfigurable Technology and Systems)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to composemultiple diverse MM accelerator architecturesworking concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\(\times\), 32.51\(\times\), 1.00\(\times\), and 1.00\(\times\)throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:https://github.com/arc-research-lab/CHARM.
more » « less
Full Text Available
AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis

https://doi.org/10.1145/3572959

Jun, Hyegang; Ye, Hanchen; Jeong, Hyunmin; Chen, Deming (September 2023, ACM Transactions on Reconfigurable Technology and Systems)

High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.
more » « less
Full Text Available
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

https://doi.org/10.1145/3600006.3613170

Reidys, Benjamin; Xue, Yuqi; Li, Daixuan; Sukhwani, Bharat; Hwu, Wen-Mei; Chen, Deming; Asaad, Sameh; Huang, Jian (October 2023, ACM)

Full Text Available
Extensible and Efficient Proxy for Neural Architecture Search

https://doi.org/10.1109/ICCV51070.2023.00570

Li, Yuhong; Li, Jiajie; Hao, Cong; Li, Pan; Xiong, Jinjun; Chen, Deming (October 2023, IEEE)

Full Text Available
ScaleHLS: a scalable high-level synthesis framework with multi-level transformations and optimizations: invited

https://doi.org/10.1145/3489517.3530631

Ye, Hanchen; Jun, HyeGang; Jeong, Hyunmin; Neuendorffer, Stephen; Chen, Deming (July 2022, ACM)

This paper presents an enhanced version of a scalable HLS (High-Level Synthesis) framework named ScaleHLS, which can compile HLS C/C++ programs and PyTorch models to highly-efficient and synthesizable C++ designs. The original version of ScaleHLS achieved significant speedup on both C/C++ kernels and PyTorch models [14]. In this paper, we first highlight the key features of ScaleHLS on tackling the challenges present in the representation, optimization, and exploration of large-scale HLS designs. To further improve the scalability of ScaleHLS, we then propose an enhanced HLS transform and analysis library supported in both C++ and Python, and a new design space exploration algorithm to handle HLS designs with hierarchical structures more effectively. Comparing to the original ScaleHLS, our enhanced version improves the speedup by up to 60.9× on FPGAs. ScaleHLS is fully open-sourced at https://github.com/hanchenye/scalehls.
more » « less
DML: Dynamic Partial Reconfiguration With Scalable Task Scheduling for Multi-Applications on FPGAs

https://doi.org/10.1109/TC.2021.3137785

Dhar, Ashutosh; Richter, Edward; Yu, Mang; Zuo, Wei; Wang, Xiaohao; Kim, Nam Sung; Chen, Deming (October 2022, IEEE Transactions on Computers)

Full Text Available
CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

https://doi.org/10.1145/3543622.3573210

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Du, Yubo; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (February 2023, Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
more » « less
Full Text Available
ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation

https://doi.org/10.1109/HPCA53966.2022.00060

Ye, Hanchen; Hao, Cong; Cheng, Jianyi; Jeong, Hyunmin; Huang, Jack; Neuendorffer, Stephen; Chen, Deming (April 2022, ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation)

Full Text Available
HELLO: improved neural network architectures and methodologies for small variant calling

https://doi.org/10.1186/s12859-021-04311-4

Ramachandran, Anand; Lumetta, Steven S.; Klee, Eric W.; Chen, Deming (December 2021, BMC Bioinformatics)
null (Ed.)
Abstract Background Modern Next Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. Parallel developments in Deep Learning have enabled the application of Deep Neural Networks to variant calling, surpassing the accuracy of classical approaches in many settings. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. In this paper, we explore an alternative approach to designing Deep Neural Networks for variant calling, where we use meticulously designed Deep Neural Network architectures and customized variant inference functions that account for the underlying nature of sequencing data instead of converting the problem to one of image recognition. Results Results from 27 whole-genome variant calling experiments spanning Illumina, PacBio and hybrid Illumina-PacBio settings suggest that our method allows vastly smaller Deep Neural Networks to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. For example, our method reduces the number of indel call errors by up to 18%, 55% and 65% for Illumina, PacBio and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, our models are between 7 and 14 times smaller. Conclusions We believe that the improved accuracy and problem-specific customization of our models will enable more accurate pipelines and further method development in the field. HELLO is available at https://github.com/anands-repo/hello
more » « less
Full Text Available

« Prev Next »

Search for: All records