The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM. To understand the current state-of-the-art and performance differences between FPGAs and GPUs, we consider realized memory bandwidth for recent FPGA and GPU platforms. We utilize a custom STREAM benchmark to evaluate two Intel FPGA platforms, the Stratix 10 SX PAC and Bittware 520N-MX, two AMD/Xilinx FPGA platforms, the Alveo U250 and Alveo U280, as well as GPU platforms from NVIDIA and AMD. We also extract power measurements and estimate memory bandwidth per Watt ((GB/s)/W) on these platforms to evaluate how FPGAs compare against GPU execution. While the GPUs far exceed the FPGAs in raw performance, the HBM equipped FPGAs demonstrate a competitive performance-power balance for larger data sizes that can be easily implemented with oneAPI and Vitis HLS kernels. These findings suggest a potential sweet spot for this emerging FPGA ecosystem to serve bandwidth limited applications in an energy-efficient fashion.
more »
« less
Optimized FPGA-based Deep Learning Accelerator for Sparse CNN using High Bandwidth Memory
Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU.
more »
« less
- Award ID(s):
- 1738420
- PAR ID:
- 10289017
- Date Published:
- Journal Name:
- IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
- Page Range / eLocation ID:
- 157 to 164
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.more » « less
-
Point cloud is an important type of geometric data structure for many embedded applications such as autonomous driving and augmented reality. Current Point Cloud Networks (PCNs) have proven to achieve great success in using inference to perform point cloud analysis, including object part segmentation, shape classification, and so on. However, point cloud applications on the computing edge require more than just the inference step. They require an end-to-end (E2E) processing of the point cloud workloads: pre-processing of raw data, input preparation, and inference to perform point cloud analysis. Current PCN approaches to support end-to-end processing of point cloud workload cannot meet the real-time latency requirement on the edge, i.e., the ability of the AI service to keep up with the speed of raw data generation by 3D sensors. Latency for end-to-end processing of the point cloud workloads stems from two reasons: memory-intensive down-sampling in the pre-processing phase and the data structuring step for input preparation in the inference phase. In this paper, we present HgPCN, an end-to-end heterogeneous architecture for real-time embedded point cloud applications. In HgPCN, we introduce two novel methodologies based on spatial indexing to address the two identified bottlenecks. In the Pre-processing Engine of HgPCN, an Octree-Indexed-Sampling method is used to optimize the memory-intensive down-sampling bottleneck of the pre-processing phase. In the Inference Engine, HgPCN extends a commercial DLA with a customized Data Structuring Unit which is based on a Voxel-Expanded Gathering method to fundamentally reduce the workload of the data structuring step in the inference phase. The initial prototype of HgPCN has been implemented on an Intel PAC (Xeon+FPGA) platform. Four commonly available point cloud datasets were used for comparison, running on three baseline devices: Intel Xeon W-2255, Nvidia Xavier NX Jetson GPU, and Nvidia 4060ti GPU. These point cloud datasets were also run on two existing PCN accelerators for comparison: PointACC and Mesorasi. Our results show that for the inference phase, depending on the dataset size, HgPCN achieves speedup from 1.3× to 10.2× vs. PointACC, 2.2× to 16.5× vs. Mesorasi, and 6.4× to 21× vs. Jetson NX GPU. Along with optimization of the memory-intensive down-sampling bottleneck in pre-processing phase, the overall latency shows that HgPCN can reach the real-time requirement by providing end-to-end service with keeping up with the raw data generation rate.more » « less
-
SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.more » « less
-
null (Ed.)SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.more » « less
An official website of the United States government

