skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP
Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. To leverage the hardware intrinsics low-bit function units (32/64-bit), arbitrary-precision integer multiplication can be calculated using Karatsuba decomposition, and Schoolbook decomposition by decomposing the two large operands into several small operands, generating a set of low-bit multiplications that can be processed either in a spatial or sequential manner on the low-bit function units, e.g., CPU vector instructions, GPU CUDA cores, FPGA digital signal processing (DSP) blocks. Among these accelerators, reconfigurable computing, e.g., FPGA accelerators are promised to provide both good energy efficiency and flexibility. We implement the state-of-the-art (SOTA) FPGA accelerator and compare it with the SOTA libraries on CPUs and GPUs. Surprisingly, in terms of energy efficiency, we find that the FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We first identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units, i.e., vector instruction units in CPUs and CUDA cores in GPUs. FPGA uses DSPs and lookup tables (LUTs) to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., “FPGA+vector units” is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a “FPGA+vector units” platform, i.e., AMD/Xilinx Versal ACAP architecture, a heterogeneous reconfigurable computing platform that features 400 AI engine tensor cores (AIE) running at 1 GHz, FPGA programmable logic (PL), and a general-purpose CPU in the system fabricated with the TSMC 7nm technology. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM accelerator is composed of AIEs, PL, and CPU. AIM framework includes analytical models to guide design space exploration and AIM automatic code generation to facilitate the system design and on-board design verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that compared to existing accelerators, AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs.  more » « less
Award ID(s):
2324864 2328972 2213701 2217003
PAR ID:
10485749
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-2225-5
Page Range / eLocation ID:
1 to 9
Subject(s) / Keyword(s):
Heterogeneous reconfigurable computing architecture, Versal ACAP, mapping framework, arbitrary-precision integer computing
Format(s):
Medium: X
Location:
San Francisco, CA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. While Vision Transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (< 1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency real-time requirement, even with quantized ViT models. Some approaches use pruning or sparsity to reduce model size and latency, but this often results in accuracy loss. To address the aforementioned constraints, in this work, we propose EQ-ViT, an end-to-end acceleration framework with novel algorithm and architecture co-design features to enable real-time ViT acceleration on AMD Versal Adaptive Compute Acceleration Platform (ACAP). The contributions are four-fold. First, we perform in-depth kernel- level performance profiling & analysis and explain the bottlenecks for existing acceleration solutions on GPU, FPGA, and ACAP. Second, on the hardware level, we introduce a new spatial and heterogeneous accelerator architecture, EQ-ViT architec- ture. This architecture leverages the heterogeneous features of ACAP, where both FPGA and artificial intelligence engines (AIEs) coexist on the same system-on-chip (SoC). Third, On the algorithm level, we create a comprehensive quantization-aware training strategy, EQ-ViT algorithm. This strategy concurrently quantizes both weights and activations into 8-bit integers, aiming to improve accuracy rather than compromise it during quanti- zation. Notably, the method also quantizes nonlinear functions for efficient hardware implementation. Fourth, we design EQ- ViT automation framework to implement the EQ-ViT architec- ture for four different ViT applications on the AMD Versal ACAP VCK190 board, achieving accuracy improvement with 2.4%, and average speedups of 315.0x, 3.39x, 3.38x, 14.92x, 59.5x, 13.1x over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, and AMD ZCU102, U250 FPGAs. The energy efficiency gains are 62.2x, 15.33x, 12.82x, 13.31x, 13.5x, 21.9x. 
    more » « less
  2. Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator. 
    more » « less
  3. Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x. 
    more » « less
  4. As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250, 2.32x (FP32) than Nvidia Jetson TX2, 1.06x (FP32), 1.70x (INT8) than Nvidia A100. 
    more » « less
  5. Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a unified, high-level programming interface for CPUs, GPUs and FPGAs. This has led to the development of compiler toolchains to deploy OpenCL code on FPGA. However, the fundamental architectural differences between GPUs and FPGAs have led to performance portability issues: it has been shown that OpenCL code optimized for GPU does not necessarily map well to FPGA, often requiring manual optimizations to improve performance. In this paper, we explore the use of thread coarsening - a compiler technique that consolidates the work of multiple threads into a single thread - on OpenCL code running on FPGA. While this optimization has been explored on CPU and GPU, the architectural features of FPGAs and the nature of the parallelism they offer lead to different performance considerations, making an analysis of thread coarsening on FPGA worthwhile. Our evaluation, performed on our microbenchmarks and on a set of applications from open-source benchmark suites, shows that thread coarsening can yield performance benefits (up to 3-4x speedups) to OpenCL code running on FPGA at a limited resource utilization cost. 
    more » « less