Deep-Learning has become a dominant computing paradigm across a broad range of application domains. Different architectures of Deep-Networks like CNN, MLP, and RNN have emerged as the prominent machine-learning approaches for today’s application domains. These architectures are heavily data-dependent, requiring frequent access to memory. As a result, these applications suffer the most from the memory bottleneck of the von Neumann architectures. There is an imminent need for memory-centric architectures for deep-learning and big-data analytic applications that are memory intensive. Modern Field Programmable Gate Arrays (FPGAs) are ideal programmable substrates for creating customized Processor in/near Memory (PIM) accelerators. Modern FPGAs contain 100s of Mbits of dual-ported SRAM in the form of disaggregated, configurable Block RAMs (BRAMs). These BRAMs contain TB/s of available internal bandwidth. Unfortunately, developing FPGA-based accelerators for deep learning is not a simple task and demands the utilization of specialized tools provided by the FPGA vendors. It requires expertise in low-level hardware microarchitecture design. These are often not available to most researchers in the field of deep learning. Even with the ongoing improvements in High-Level Synthesis (HLS) tools, the requirement for hardware-specific design knowledge cannot be completely eliminated. This research developed a new reconfigurable memory-centric architecture and design approach that opens the advantages of FPGAs and Processor-in-Memory architecture to memory-intensive applications. Due to its high-performance and scalable memory-centric design, this architecture can deliver the highest speed and the lowest latency achievable from an FPGA overcoming the memory bottleneck.
more »
« less
A Scalable Architecture for CNN Accelerators Leveraging High-Performance Memories
As FPGA-based accelerators become ubiquitous and more powerful, the demand for integration with High-Performance Memory (HPM) grows. Although HPMs offer a much greater bandwidth than standard DDR4 DRAM, they introduce new design challenges such as increased latency and higher bandwidth mismatch between memory and FPGA cores. This paper presents a scalable architecture for convolutional neural network accelerators conceived specifically to address these challenges and make full use of the memory's high bandwidth. The accelerator, which was designed using high-level synthesis, is highly configurable. The intrinsic parallelism of its architecture allows near-perfect scaling up to saturating the available memory bandwidth.
more »
« less
- Award ID(s):
- 1764000
- PAR ID:
- 10244217
- Date Published:
- Journal Name:
- IEEE High Performance Extreme Computing Conference (HPEC)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We present BurstZ, a bandwidth-efficient accelerator platform for scientific computing. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once the working set becomes larger than the on-board memory capacity, causing the performance to become bottlenecked either by the communication bandwidth between the host and the accelerator. Compression has not been very useful in solving this issue due to the difficulty of efficiently compressing floating point numbers, which scientific data often consists of. Most compression algorithms are either ineffective with floating point numbers, or has a high performance overhead. BurstZ is an FPGA-based accelerator platform which addresses the bandwidth issue via a novel hardware-optimized floating point compression algorithm, which we call sZFP. We demonstrate that BurstZ can completely remove the communication bottleneck for accelerators, using a 3D stencil-code accelerator implemented on a prototype BurstZ implementation. Evaluated against hand-optimized implementations of stencil code accelerators of the same architecture, our BurstZ prototype outperformed an accelerator without compression by almost 4X, and even an accelerator with enough memory for the entire dataset by over 2X. BurstZ improved communication efficiency so much, our prototype was even able to outperform the upper limit projected performance of an optimized stencil core with ideal memory access characteristics, by over 2X.more » « less
-
null (Ed.)Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved 6× higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about 2× higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbpsmore » « less
-
null (Ed.)Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU.more » « less
-
Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.more » « less
An official website of the United States government

