skip to main content


Title: Graph Neural Network High-Level Synthesis Benchmark Suite V1
With the ever-growing popularity of Graph Neural Networks (GNNs), efficient GNN inference is gaining tremendous attention. Field-Programmable Gate Arrays (FPGAs) are a promising execution platform due to their fine-grained parallelism, low power consumption, reconfigurability, and concurrent execution. Even better, High-Level Synthesis (HLS) tools help bridge the gap between the non-trivial FPGA development efforts and rapid emergence of new GNN models. To enable investigation into how effectively modern HLS tools can accelerate GNN inference, we present GNNHLS, a benchmark suite containing a software stack for data generation and baseline deployment and FPGA implementations of 6 well-tuned GNN HLS kernels.  more » « less
Award ID(s):
1763503
NSF-PAR ID:
10475122
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Washington University in St. Louis
Date Published:
Subject(s) / Keyword(s):
["FOS: Computer and information sciences","Graph Neural Networks","GNN","Field-Programmable Gate Arrays","FPGA","High-Level Synthesis","HLS","benchmark"]
Format(s):
Medium: X Size: 6.0 MB
Size(s):
["6.0 MB"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Probabilistic Sentential Decision Diagrams (PSDDs) provide efficient methods for modeling and reasoning with probability distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such as Bayesian networks (BNs) therefore offering a new set of tools for performing inference on these models (in time linear in the PSDD size). Despite these favorable characteristics of PSDDs, we have found multiple challenges in PSDD’s FPGA acceleration. Problems include limited parallelism, data dependency, and small pipeline iterations. In this article, we propose several optimization techniques to solve these issues with novel pipeline scheduling and parallelization schemes. We designed the PSDD kernel with a high-level synthesis (HLS) tool for ease of implementation and verified it on the Xilinx Alveo U250 board. Experimental results show that our methods improve the baseline FPGA HLS implementation performance by 2,200X and the multicore CPU implementation by 20X. The proposed design also outperforms state-of-the-art BN and Sum Product Network (SPN) accelerators that store the graph information in memory. 
    more » « less
  2. Deep Neural Networks (DNNs) have been successfully applied in many fields. Considering performance, flexibility, and energy efficiency, Field Programmable Gate Array (FPGA) based accelerator for DNNs is a promising solution. The existing frameworks however lack the possibility of reusability and friendliness to design a new network with minimum efforts. Modern high-level synthesis (HLS) tools greatly reduce the turnaround time of designing and implementing complex FPGA-based accelerators. This paper presents a framework for hardware accelerator for DNNs using high level specification. A novel architecture is introduced that maximizes data reuse and external memory bandwidth. This framework allows to generate a scalable HLS code for a given pre-trained model that can be mapped to different FPGA platforms. Various HLS compiler optimizations have been applied to the code to produce efficient implementation and high resource utilization. The framework achieves a peak performance of 23 frames per second for SqueezeNet on Xilinx Alveo u250 board. 
    more » « less
  3. Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in a diverse set of realistic and complex FPGA HLS designs. (1) We observe that in almost all cases the frequency degradation is caused by the broadcast structures generated by the HLS compiler. (2)We classify three major types of broadcasts in HLS-generated designs, including high-fanout data signals, pipeline flow control signals and synchronization signals for concurrent modules. (3) We reveal a number of limitations of the current HLS tools that result in those broadcast-related timing issues. (4) We propose a set of effective yet easy-to-implement approaches, including broadcast-aware scheduling, synchronization pruning, and skid-buffer-based flow control. Our experimental results show that our methods can improve the maximum frequency of a set of nine representative HLS benchmarks by 53% on average. In some cases, the frequency gain is more than 100 MHz. 
    more » « less
  4. Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS) , accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework— AutoDSE —that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for MachSuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators. 
    more » « less
  5. FPGAs are promising platforms for accelerating irregular applications due to their ability to implement highly specialized hardware designs for each kernel. However, the design and implementation of FPGA-accelerated kernels can take several months using hardware design languages. High Level Synthesis (HLS) tools provide fast, high quality results for regular applications, but lack the support to effectively accelerate more irregular, complex workloads. This work analyzes the challenges and benefits of using a commercial state-of-the-art HLS tool and its available optimizations to accelerate graph sampling. We evaluate the resulting designs and their effectiveness when deployed in a state-of-the-art heterogeneous framework that implements the Influence Maximization with Martingales (IMM) algorithm, a complex graph analytics algorithm. We discuss future opportunities for improvement in hardware, HLS tools, and hardware/software co-design methodology to better support complex irregular applications such as IMM. 
    more » « less