skip to main content

Title: Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver
The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94×, 3.36× higher throughput, and 2.94× better energy efficiency. Compared to an NVIDIA A100 GPU which has 4× the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34× higher energy efficiency. The code is available at  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '23)
Page Range / eLocation ID:
247 to 258
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM. To understand the current state-of-the-art and performance differences between FPGAs and GPUs, we consider realized memory bandwidth for recent FPGA and GPU platforms. We utilize a custom STREAM benchmark to evaluate two Intel FPGA platforms, the Stratix 10 SX PAC and Bittware 520N-MX, two AMD/Xilinx FPGA platforms, the Alveo U250 and Alveo U280, as well as GPU platforms from NVIDIA and AMD. We also extract power measurements and estimate memory bandwidth per Watt ((GB/s)/W) on these platforms to evaluate how FPGAs compare against GPU execution. While the GPUs far exceed the FPGAs in raw performance, the HBM equipped FPGAs demonstrate a competitive performance-power balance for larger data sizes that can be easily implemented with oneAPI and Vitis HLS kernels. These findings suggest a potential sweet spot for this emerging FPGA ecosystem to serve bandwidth limited applications in an energy-efficient fashion. 
    more » « less
  2. null (Ed.)
    Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved 6× higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about 2× higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps 
    more » « less
  3. Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. To leverage the hardware intrinsics low-bit function units (32/64-bit), arbitrary-precision integer multiplication can be calculated using Karatsuba decomposition, and Schoolbook decomposition by decomposing the two large operands into several small operands, generating a set of low-bit multiplications that can be processed either in a spatial or sequential manner on the low-bit function units, e.g., CPU vector instructions, GPU CUDA cores, FPGA digital signal processing (DSP) blocks. Among these accelerators, reconfigurable computing, e.g., FPGA accelerators are promised to provide both good energy efficiency and flexibility. We implement the state-of-the-art (SOTA) FPGA accelerator and compare it with the SOTA libraries on CPUs and GPUs. Surprisingly, in terms of energy efficiency, we find that the FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We first identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units, i.e., vector instruction units in CPUs and CUDA cores in GPUs. FPGA uses DSPs and lookup tables (LUTs) to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., “FPGA+vector units” is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a “FPGA+vector units” platform, i.e., AMD/Xilinx Versal ACAP architecture, a heterogeneous reconfigurable computing platform that features 400 AI engine tensor cores (AIE) running at 1 GHz, FPGA programmable logic (PL), and a general-purpose CPU in the system fabricated with the TSMC 7nm technology. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM accelerator is composed of AIEs, PL, and CPU. AIM framework includes analytical models to guide design space exploration and AIM automatic code generation to facilitate the system design and on-board design verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that compared to existing accelerators, AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs. 
    more » « less
  4. In this paper, we propose GraphiDe, a novel DRAM-based processing-in-memory (PIM) accelerator for graph processing. It transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of the modern memory chips to accelerate various graph processing applications. GraphiDe can be leveraged to greatly reduce energy consumption and latency dealing with underlying adjacency matrix computations by eliminating unnecessary off-chip accesses. The extensive circuit-architecture simulations over three social network data-sets indicate that GraphiDe achieves on average 3.1x energy-efficiency improvement and 4.2x speed-up over the recent DRAM based PIM platform. It achieves ~59x higher energy-efficiency and 83x speed-up over GPU-based acceleration methods. 
    more » « less
  5. null (Ed.)
    Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU. 
    more » « less