Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by programmers unfamiliar with hardware description languages, and to allow to deploy a single code on different devices seamlessly. Today, both Intel and Xilinx offer toolchains to compile OpenCL code onto FPGA. However, using OpenCL on FPGAs is complicated by performance portability issues, since different devices have fundamental differences in architecture and nature of hardware parallelism they offer. Hence, platform-specific optimizations are crucial to achieving good performance across devices. In this paper, we propose a code transformation to improve the performance of OpenCL codes running on FPGA. The proposed method uses pipes to separate the memory accesses and core computation within OpenCL kernels. We analyze the benefits of the approach as well as the restrictions to its applicability. Using OpenCL applications from popular benchmark suites, we show that this code transformation can result in higher utilization of the global memory bandwidth available and increased instruction concurrency, thus improving the overall throughput of OpenCL kernels at the cost of a modest resource utilization overhead. Further concurrency can be achieved by using multiple memory and compute kernels.
more »
« less
Fast FPGA Reverse Engineering for Hardware Metering and Fingerprinting
We describe a fast, abstract method for reverse engineering (RE) field programmable gate array (FPGA) look-up-tables (LUTs). Our method has direct applications to hardware (HW) metering and FPGA fingerprinting, and our approach allows easy portability and application to most L UT based FPGAs. Unlike conventional RE methodologies that rely on vendor specific code (like Xilinx XDL), tools, configuration files, components, etc., our methodology is not dependent on any specific FPGA or FPGA computer aided design (CAD) tool. We use generic hardware description language (HDL) code based on specially connected CASE statements to program the L UTs on a target FPGA. Our specially connected CASE statements allow us to guide placement of L UT functions on successive synthesis runs. This enables us to quickly determine which bits in the FPGA 's configuration file match to FPGA L UT bits. After we know which bits are L UT bits, we can go further and match specific LUT bits to specific bits in the configuration file, thereby creating a one-to-one mapping between every L UT memory cell and its matching bit in the configuration file. In this paper we present our CASE statement functions for performing one-to-one mapping of all FPGA L UT memory cell bits to specific configuration file bits. We have successfully applied our methods to several 7000 series Xilinx and Intel (Altera) FPGAs.
more »
« less
- Award ID(s):
- 1916722
- PAR ID:
- 10536734
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-3878-2
- Page Range / eLocation ID:
- 147 to 151
- Format(s):
- Medium: X
- Location:
- Dayton, OH, USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)While FPGAs have been traditionally considered hard to program, recently there have been efforts aimed to allow the use of high-level programming models and libraries intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing toolchains to deploy OpenCL code onto FPGA. However, because the nature of the parallelism offered by GPU and FPGA devices is fundamentally different, OpenCL code optimized for GPU can prove very inefficient on FPGA, in terms of both performance and hardware resource utilization. This paper explores this problem on finite automata traversal. In particular, we consider an OpenCL NFA traversal kernel optimized for GPU but exhibiting FPGA-friendly characteristics, namely: limited memory requirements, lack of synchronization, and SIMD execution. We explore a set of structural code changes, custom and best-practice optimizations to retarget this code to FPGA. We showcase the effect of these optimizations on an Intel Stratix V FPGA board using various NFA topologies from different application domains. Our evaluation shows that, while the resource requirements of the original code exceed the capacity of the FPGA in use, our optimizations lead to significant resource savings and allow the transformed code to fit the FPGA for all considered NFA topologies. In addition, our optimizations lead to speedups up to 4x over an already optimized code-variant aimed to fit the NFA traversal kernel on FPGA. Some of the proposed optimizations can be generalized for other applications and introduced in OpenCL-to-FPGA compiler.more » « less
-
Reverse engineering (RE) is a widespread practice within engineering, and it is particularly relevant for discovering maliciousfunctionality in digital hardware components. In this paper, we discuss bitstream or firmware RE for field programable gate arrays (FPGAs). A bitstream establishes the configuration of the FPGA device fabric. Complete knowledge of both the physical device fabric and a specific bitstream should be sufficient to determine the complete configuration of the programmed FPGA. However, a significant challenge to bitstream RE arises because information about the FPGA fabric and interpretation of the bitstream is typically incomplete. The uncertainties limit the confidence in the correctness of any configuration determined through the RE process. This paper identifies representative sources of uncertainty in bitstream RE of FPGA devices.more » « less
-
The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM. To understand the current state-of-the-art and performance differences between FPGAs and GPUs, we consider realized memory bandwidth for recent FPGA and GPU platforms. We utilize a custom STREAM benchmark to evaluate two Intel FPGA platforms, the Stratix 10 SX PAC and Bittware 520N-MX, two AMD/Xilinx FPGA platforms, the Alveo U250 and Alveo U280, as well as GPU platforms from NVIDIA and AMD. We also extract power measurements and estimate memory bandwidth per Watt ((GB/s)/W) on these platforms to evaluate how FPGAs compare against GPU execution. While the GPUs far exceed the FPGAs in raw performance, the HBM equipped FPGAs demonstrate a competitive performance-power balance for larger data sizes that can be easily implemented with oneAPI and Vitis HLS kernels. These findings suggest a potential sweet spot for this emerging FPGA ecosystem to serve bandwidth limited applications in an energy-efficient fashion.more » « less
-
null (Ed.)Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved 6× higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about 2× higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbpsmore » « less
An official website of the United States government

