Field programmable gate arrays (FPGAs) provide both performance and power benefits to heterogeneous systems. In this work, we used a closely-coupled CPU-FPGA heterogeneous system to accelerate the Canny edge detector algorithm and compared the performance of the hybrid implementation with that of the optimized separate CPU and FPGA implementations. Our results show up to 4.8X speedup for the hybrid implementation over the CPU only implementation and up to 2.1X over the FPGA only implementation.
more »
« less
Efficient OpenCL Accelerators for Canny Edge Detection Algorithm on a CPU-FPGA Platform
The processing demands of current and emerging applications, such as image/video processing, are increasing due to the deluge of data, generated by mobile and edge devices. This raises challenges for a vast range of computing systems, starting from smart-phones and reaching cloud and data centers. Heterogeneous computing demonstrates its ability as an efficient computing model due to its capability to adapt to various workload requirements. Field programmable gate arrays (FPGAs) provide power and performance benefits and have been used in many application domains from embedded systems to the cloud. In this paper, we used a closely coupled CPU-FPGA heterogeneous system to accelerate a sliding window based image processing algorithm, Canny edge detector. We accelerated Canny using two different implementations: Code partitioned and data partitioned. In the data partitioned implementation, we proposed a weighted round-robin based algorithm that partitions input images and distributes the load between the CPU and the FPGA based on latency. The paper also compares the performance of the proposed accelerators with separate CPU and FPGA implementations. Using our hybrid CPU-FPGA based algorithm, we achieved a speedup of up to 4.8× over a CPU-only and up to 2.1× over a FPGA-only implementations. Moreover, the estimated total energy consumption of our algorithm is more efficient than a CPU-only implementation. Our results show a significant reduction in energy-delay product (EDP) compared to the CPU-only implementation, and comparable EDP results to the FPGA-only implementation.
more »
« less
- Award ID(s):
- 1821691
- PAR ID:
- 10148551
- Date Published:
- Journal Name:
- International Conference on ReConFigurable Computing and FPGAs (ReConFig)
- Page Range / eLocation ID:
- 1 to 5
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures.more » « less
-
Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves 59× speedup and 123× energy reduction compared to embedded CPU, and 292× memory reduction over a standard full-size training.more » « less
-
With the proliferation of low-cost sensors and the Internet of Things, the rate of producing data far exceeds the compute and storage capabilities of today’s infrastructure. Much of this data takes the form of time series, and in response, there has been increasing interest in the creation of time series archives in the last decade, along with the development and deployment of novel analysis methods to process the data. The general strategy has been to apply a plurality of similarity search mechanisms to various subsets and subsequences of time series data in order to identify repeated patterns and anomalies; however, the computational demands of these approaches renders them incompatible with today’s power-constrained embedded CPUs. To address this challenge, we present FA-LAMP, an FPGA-accelerated implementation of the Learned Approximate Matrix Profile (LAMP) algorithm, which predicts the correlation between streaming data sampled in real-time and a representative time series dataset used for training. FA-LAMP lends itself as a real-time solution for time series analysis problems such as classification. We present the implementation of FA-LAMP on both edge- and cloud-based prototypes. On the edge devices, FA-LAMP integrates accelerated computation as close as possible to IoT sensors, thereby eliminating the need to transmit and store data in the cloud for posterior analysis. On the cloud-based accelerators, FA-LAMP can execute multiple LAMP models on the same board, allowing simultaneous processing of incoming data from multiple data sources across a network. LAMP employs a Convolutional Neural Network (CNN) for prediction. This work investigates the challenges and limitations of deploying CNNs on FPGAs using the Xilinx Deep Learning Processor Unit (DPU) and the Vitis AI development environment. We expose several technical limitations of the DPU, while providing a mechanism to overcome them by attaching custom IP block accelerators to the architecture. We evaluate FA-LAMP using a low-cost Xilinx Ultra96-V2 FPGA as well as a cloud-based Xilinx Alveo U280 accelerator card and measure their performance against a prototypical LAMP deployment running on a Raspberry Pi 3, an Edge TPU, a GPU, a desktop CPU, and a server-class CPU. In the edge scenario, the Ultra96-V2 FPGA improved performance and energy consumption compared to the Raspberry Pi; in the cloud scenario, the server CPU and GPU outperformed the Alveo U280 accelerator card, while the desktop CPU achieved comparable performance; however, the Alveo card offered an order of magnitude lower energy consumption compared to the other four platforms. Our implementation is publicly available at https://github.com/aminiok1/lamp-alveo.more » « less
-
null (Ed.)We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks. The two complementary FPGA designs are based on OpenCL, a framework for writing programs that execute across heterogeneous platforms, and hls4ml, a high-level-synthesis-based compiler for neural network to firmware conversion. We evaluate and compare the resource usage, latency, and tracking performance of our implementations based on a benchmark dataset. We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing workflows and the FPGA-based Level-1 trigger at the CERN Large Hadron Collider.more » « less