skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration
Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa  (Compute-in-Memory Blocks forFPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.  more » « less
Award ID(s):
1763848
NSF-PAR ID:
10488312
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Reconfigurable Technology and Systems
Volume:
16
Issue:
3
ISSN:
1936-7406
Page Range / eLocation ID:
50:1-50:34
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain. 
    more » « less
  2. The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94×, 3.36× higher throughput, and 2.94× better energy efficiency. Compared to an NVIDIA A100 GPU which has 4× the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34× higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla. 
    more » « less
  3. The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM. To understand the current state-of-the-art and performance differences between FPGAs and GPUs, we consider realized memory bandwidth for recent FPGA and GPU platforms. We utilize a custom STREAM benchmark to evaluate two Intel FPGA platforms, the Stratix 10 SX PAC and Bittware 520N-MX, two AMD/Xilinx FPGA platforms, the Alveo U250 and Alveo U280, as well as GPU platforms from NVIDIA and AMD. We also extract power measurements and estimate memory bandwidth per Watt ((GB/s)/W) on these platforms to evaluate how FPGAs compare against GPU execution. While the GPUs far exceed the FPGAs in raw performance, the HBM equipped FPGAs demonstrate a competitive performance-power balance for larger data sizes that can be easily implemented with oneAPI and Vitis HLS kernels. These findings suggest a potential sweet spot for this emerging FPGA ecosystem to serve bandwidth limited applications in an energy-efficient fashion. 
    more » « less
  4. This work presents the first resistive random access memory (RRAM)-based compute-in-memory (CIM) macro design tailored for genome processing. We analyze and demonstrate two key types of genome processing applications using our developed CIM chip prototype: the state-of-the-art (SOTA) burrows–wheeler transform (BWT)-based DNA short- read alignment and alignment-free mRNA quantification. Our CIM macro is designed and optimized to support the major functions essential to these algorithms, e.g., parallel XNOR operations, count, addition, and parallel bit-wise and operations. The proposed CIM macro prototype is fabricated with monolithic integration of HfO2 RRAM and 65-nm CMOS, achieving 2.07 TOPS/W (tera-operations per second per watt) and 2.12 G suffixes/J (suffixes per joule) at 1.0 V, which is the most energy-efficient solution to date for genome processing. 
    more » « less
  5. null (Ed.)
    Processing-in-memory (PIM) architectures attempt to overcome the von Neumann bottleneck by combining computation and storage logic into a single component. The content-addressable parallel processing paradigm (CAPP) from the seventies is an in-situ PIM architecture that leverages content-addressable memories to realize bit-serial arithmetic and logic operations, via sequences of search and update operations over multiple memory rows in parallel. In this paper, we set out to investigate whether the concepts behind classic CAPP can be used successfully to build an entirely CMOS-based, general-purpose microarchitecture that can deliver manyfold speedups while remaining highly programmable. We conduct a full-stack design of a Content-Addressable Processing Engine (CAPE), built out of dense push-rule 6T SRAM arrays. CAPE is programmable using the RISC-V ISA with standard vector extensions. Our experiments show that CAPE achieves an average speedup of 14 (up to 254) over an area-equivalent (slightly under 9mm^2 at 7nm) out-of-order processor core with three levels of caches. 
    more » « less