Optimized FPGA-based Deep Learning Accelerator for Sparse CNN using High Bandwidth Memory

Jiang, Chao; Ojika, David; Patel, Bhavesh; Lam, Herman

doi:10.1109/FCCM51124.2021.00026

Citation Details

Optimized FPGA-based Deep Learning Accelerator for Sparse CNN using High Bandwidth Memory

Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU. more »

Award ID(s):: 1738420

PAR ID:: 10289017

Author(s) / Creator(s):: Jiang, Chao; Ojika, David; Patel, Bhavesh; Lam, Herman

Date Published:: 2021-05-01

Journal Name:: IEEE Annual International Symposium on Field-Programmable Custom Computing Machines

Page Range / eLocation ID:: 157 to 164

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/FCCM51124.2021.00026

More Like this