NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs

https://doi.org/10.1145/3620666.3651347

Prakriya, Neha; Chi, Yuze; Basalama, Suhail; Song, Linghao; Cong, Jason (April 2024, ACM)

Full Text Available
TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

https://doi.org/10.1145/3609335

Guo, Licheng; Chi, Yuze; Lau, Jason; Song, Linghao; Tian, Xingyu; Khatti, Moazin; Qiao, Weikang; Wang, Jie; Ustun, Ecenur; Fang, Zhenman; et al (December 2023, ACM Transactions on Reconfigurable Technology and Systems)

In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allows users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments, we make the originally unroutable designs achieve 274 MHz, on average. The framework is available athttps://github.com/UCLA-VAST/tapaand the core floorplan module is available athttps://github.com/UCLA-VAST/AutoBridge
more » « less
Full Text Available
Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

https://doi.org/10.1145/3543622.3573182

Song, Linghao; Guo, Licheng; Basalama, Suhail; Chi, Yuze; Lucas, Robert F.; Cong, Jason (February 2023, Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '23))

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94×, 3.36× higher throughput, and 2.94× better energy efficiency. Compared to an NVIDIA A100 GPU which has 4× the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34× higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.
more » « less
Full Text Available
PYXIS: An Open-Source Performance Dataset Of Sparse Accelerators

https://doi.org/10.1109/ICASSP43922.2022.9746473

Song, Linghao; Chi, Yuze; Cong, Jason (May 2022, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

Full Text Available
StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing

https://doi.org/10.1109/CICC53496.2022.9772832

Sohrabizadeh, Atefeh; Chi, Yuze; Cong, Jason (April 2022, 2022 IEEE Custom Integrated Circuits Conference (CICC))

Full Text Available
Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

https://doi.org/10.1145/3490422.3502357

Song, Linghao; Chi, Yuze; Sohrabizadeh, Atefeh; Choi, Young-kyu; Lau, Jason; Cong, Jason (February 2022, FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays)

Full Text Available
Extending High-Level Synthesis for Task-Parallel Programs

https://doi.org/10.1109/FCCM51124.2021.00032

Chi, Yuze; Guo, Licheng; Lau, Jason; Choi, Young-kyu; Wang, Jie; Cong, Jason (May 2021, Proceedings of the 29th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’21))
null (Ed.)
Full Text Available
HBM Connect: High-Performance HLS Interconnect for FPGA HBM

https://doi.org/10.1145/3431920.3439301

Choi, Young-kyu; Chi, Yuze; Qiao, Weikang; Samardzic, Nikola; Cong, Jason (February 2021, Proceedings of the 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’21))
null (Ed.)
Full Text Available
Exploiting Computation Reuse for Stencil Accelerators

Chi, Yuze; Cong, Jason (July 2020, Proceedings of the 57th Design Automation Conference (DAC 2020), San Francisco, CA, July 19-23, 2020.)

Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been studying the optimizations on parallelization, communication reuse, and computation reuse for various target platforms. However, challenges still exist, especially on the computation reuse problem for accelerators, due to the lack of complete design-space exploration and effective design-space pruning. In this paper, we present solutions to the above challenges for a wide range of stencil kernels (i.e., stencil with reduction operations), where the computation reuse patterns are extremely flexible due to the commutative and associative properties. We formally define the complete design space, based on which we present a provably optimal dynamic programming algorithm and a heuristic beam search algorithm that provides near-optimal solutions under an architecture-aware model. Experimental results show that for synthesizing stencil kernels to FPGAs, compared with state-of-the-art stencil compiler without computation reuse capability, our proposed algorithm can reduce the look-up table (LUT) and digital signal processor (DSP) usage by 58.1% and 54.6% on average respectively, which leads to an average speedup of 2.3× for compute-intensive kernels, outperforming the latest CPU/GPU results.
more » « less
Full Text Available
AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

https://doi.org/10.1145/3431920.3439289

Guo, Licheng; Chi, Yuze; Wang, Jie; Lau, Jason; Qiao, Weikang; Ustun, Ecenur; Zhang, Zhiru; Cong, Jason (February 2021, Proceedings of the 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’21), Best Paper Award)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records