skip to main content


Title: An OpenCL-Based Acceleration for Canny Algorithm Using a Heterogeneous CPU-FPGA Platform
Field programmable gate arrays (FPGAs) provide both performance and power benefits to heterogeneous systems. In this work, we used a closely-coupled CPU-FPGA heterogeneous system to accelerate the Canny edge detector algorithm and compared the performance of the hybrid implementation with that of the optimized separate CPU and FPGA implementations. Our results show up to 4.8X speedup for the hybrid implementation over the CPU only implementation and up to 2.1X over the FPGA only implementation.  more » « less
Award ID(s):
1821691
PAR ID:
10148545
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Page Range / eLocation ID:
322 to 322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The processing demands of current and emerging applications, such as image/video processing, are increasing due to the deluge of data, generated by mobile and edge devices. This raises challenges for a vast range of computing systems, starting from smart-phones and reaching cloud and data centers. Heterogeneous computing demonstrates its ability as an efficient computing model due to its capability to adapt to various workload requirements. Field programmable gate arrays (FPGAs) provide power and performance benefits and have been used in many application domains from embedded systems to the cloud. In this paper, we used a closely coupled CPU-FPGA heterogeneous system to accelerate a sliding window based image processing algorithm, Canny edge detector. We accelerated Canny using two different implementations: Code partitioned and data partitioned. In the data partitioned implementation, we proposed a weighted round-robin based algorithm that partitions input images and distributes the load between the CPU and the FPGA based on latency. The paper also compares the performance of the proposed accelerators with separate CPU and FPGA implementations. Using our hybrid CPU-FPGA based algorithm, we achieved a speedup of up to 4.8× over a CPU-only and up to 2.1× over a FPGA-only implementations. Moreover, the estimated total energy consumption of our algorithm is more efficient than a CPU-only implementation. Our results show a significant reduction in energy-delay product (EDP) compared to the CPU-only implementation, and comparable EDP results to the FPGA-only implementation. 
    more » « less
  2. Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures. 
    more » « less
  3. null (Ed.)
    We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks. The two complementary FPGA designs are based on OpenCL, a framework for writing programs that execute across heterogeneous platforms, and hls4ml, a high-level-synthesis-based compiler for neural network to firmware conversion. We evaluate and compare the resource usage, latency, and tracking performance of our implementations based on a benchmark dataset. We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing workflows and the FPGA-based Level-1 trigger at the CERN Large Hadron Collider. 
    more » « less
  4. null (Ed.)
    Abstract Given two datasets (or tables) A and B and a search distance $$\epsilon$$ ϵ , the distance similarity join, denoted as $$A \ltimes _\epsilon B$$ A ⋉ ϵ B , finds the pairs of points ( $$p_a$$ p a , $$p_b$$ p b ), where $$p_a \in A$$ p a ∈ A and $$p_b \in B$$ p b ∈ B , and such that the distance between $$p_a$$ p a and $$p_b$$ p b is $$\le \epsilon$$ ≤ ϵ . If $$A = B$$ A = B , then the similarity join is equivalent to a similarity self-join, denoted as $$A \bowtie _\epsilon A$$ A ⋈ ϵ A . We propose in this paper Heterogeneous Epsilon Grid Joins ( HEGJoin ), a heterogeneous CPU-GPU distance similarity join algorithm. Efficiently partitioning the work between the CPU and the GPU is a challenge. Indeed, the work partitioning strategy needs to consider the different characteristics and computational throughput of the processors (CPU and GPU), as well as the data-dependent nature of the similarity join that accounts in the overall execution time (e.g., the number of queries, their distribution, the dimensionality, etc.). In addition to HEGJoin , we design in this paper a dynamic and two static work partitioning strategies. We also propose a performance model for each static partitioning strategy to perform the distribution of the work between the processors. We evaluate the performance of all three partitioning methods by considering the execution time and the load imbalance between the CPU and GPU as performance metrics. HEGJoin achieves a speedup of up to $$5.46\times$$ 5.46 × ( $$3.97\times$$ 3.97 × ) over the GPU-only (CPU-only) algorithms on our first test platform and up to $$1.97\times$$ 1.97 × ( $$12.07\times$$ 12.07 × ) on our second test platform over the GPU-only (CPU-only) algorithms. 
    more » « less
  5. Prompted by the ever-growing demand for high-performance System-on-Chip (SoC) and the plateauing of CPU frequencies, the SoC design landscape is shifting. In a quest to offer programmable specialization, the adoption of tightly-coupled FPGAs co-located with traditional compute clusters has been embraced by major vendors. This CPU+FPGA architectural paradigm opens the door to novel hardware/software co-design opportunities. The key principle is that CPU-originated memory traffic can be re-routed through the FPGA for analysis and management purposes. Albeit promising, the side-effect of this approach is that time-critical operations—such as cache-line refills—are fulfilled by moving data over slower interconnects meant for I/O traffic. In this article, we introduce a novel principle named Cache Coherence Backstabbing to precisely tackle these shortcomings. The technique leverages the ability to include the FGPA in the same coherence domain as the core processing elements. Importantly, this enables Coherence-Aided Elective and Seamless Alternative Routing (CAESAR), i.e., seamless inspection and routing of memory transactions, especially cache-line refills, through the FPGA. CAESAR allows the definition of new memory programming paradigms. We discuss the intrinsic potentials of the approach and evaluate it with a full-stack prototype implementation on a commercial platform. Our experiments show an improvement of up to 29% in read bandwidth, 23% in latency, and 13% in pragmatic workloads over the state of the art. Furthermore, we showcase the first in-coherence-domain runtime profiler design as a use-case of the CAESAR approach. 
    more » « less