NSF PAR Search | NSF Public Access Repository

FPGAs commonly have significantly lower clock frequencies than many microprocessors and GPUs, due largely to propagation delays incurred by the reconfigurable interconnect. The Stratix 10 HyperFlex architecture reduces this problem by embedding numerous registers throughout the routing resources. However, such Hyper-Registers do not support back-pressure (i.e., pipeline stalls) that is commonly used in FPGA pipelines. In this paper, we present and evaluate pipeline transformations using absorption FIFOs, which avoid back-pressure limitations to enable numerous pipelines to benefit from HyperFlex, while also eliminating potentially expensive stall penalties incurred by existing techniques. We demonstrate that these transformations not only enable significant clock improvements on Stratix 10, but also for devices without HyperFlex, potentially making absorption FIFOs a better high-frequency strategy for any FPGA. In addition, we introduce optimizations that yield additional performance improvements by reducing stall penalties that can increase linearly with pipeline depth when restarting after a stall.

Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems

https://doi.org/10.1145/3174243.3174262

Stitt, Greg; Gupta, Abhay; Emas, Madison N.; Wilson, David; Baylis, Austin (February 2018, 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18)

Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.

Full Text Available

Search for: All records