CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

Arora, Aman; Bhamburkar, Atharva; Borda, Aatman; Anand, Tanmay; Sehgal, Rishabh; Hanindhito, Bagus; Gaillardon, Pierre-Emmanuel; Kulkarni, Jaydeep; John, Lizy K.

doi:10.1145/3603504

Citation Details

This content will become publicly available on July 27, 2024

CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-in-Memory Blocks forFPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads. more »

Award ID(s):: 1763848

NSF-PAR ID:: 10488312

Author(s) / Creator(s):: Arora, Aman; Bhamburkar, Atharva; Borda, Aatman; Anand, Tanmay; Sehgal, Rishabh; Hanindhito, Bagus; Gaillardon, Pierre-Emmanuel; Kulkarni, Jaydeep; John, Lizy K.

Publisher / Repository:: ACM

Date Published:: 2023-07-27

Journal Name:: ACM Transactions on Reconfigurable Technology and Systems

Volume:: 16

Issue:: 3

ISSN:: 1936-7406

Page Range / eLocation ID:: 50:1-50:34

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 27, 2024
Journal Article:
https://doi.org/10.1145/3603504

More Like this