skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators
Graph Neural Networks (GNNs) have achieved remarkable accuracy in cognitive tasks such as predictive analytics on graph-structured data. Hence, they have become very popular in diverse real-world applications. However, GNN training with large real-world graph datasets in edge-computing scenarios is both memory- and compute-intensive. Traditional computing platforms such as CPUs and GPUs do not provide the energy efficiency and low latency required in edge intelligence applications due to their limited memory bandwidth. Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have been proposed as suitable candidates for accelerating AI applications at the edge, including GNN training. However, ReRAM-based PIM architectures suffer from low reliability due to their limited endurance, and low performance when they are used for GNN training in real-world scenarios with large graphs. In this work, we propose a learning-for-data-pruning framework, which leverages a trained Binary Graph Classifier (BGC) to reduce the size of the input data graph by pruning subgraphs early in the training process to accelerate the GNN training process on ReRAM-based architectures. The proposed light-weight BGC model reduces the amount of redundant information in input graph(s) to speed up the overall training process, improves the reliability of the ReRAM-based PIM accelerator, and reduces the overall training cost. This enables fast, energy-efficient, and reliable GNN training on ReRAM-based architectures. Our experimental results demonstrate that using this learning for data pruning framework, we can accelerate GNN training and improve the reliability of ReRAM-based PIM architectures by up to 1.6×, and reduce the overall training cost by 100× compared to state-of-the-art data pruning techniques.  more » « less
Award ID(s):
2308530
PAR ID:
10613044
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Design Automation of Electronic Systems
Volume:
29
Issue:
5
ISSN:
1084-4309
Page Range / eLocation ID:
1 to 29
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graph application workloads are dominated by random memory accesses with poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM architecture designs have focused on mapping graph computations into a set of multiply-and-accumulate (MAC) operations. ReRAMs also offer a key advantage in reducing memory latency between cores and memory by allowing for processing-in-memory (PIM). However, when implemented on a ReRAM-based manycore architecture, graph applications still pose two key challenges – significant storage requirements (particularly due to wasted zero cell storage), and significant amount of on-chip traffic. To tackle these two challenges, in this paper we propose the design of a 3D NoC-enabled ReRAM-based manycore architecture. Our proposed architecture incorporates a novel crossbar-aware node reordering to reduce ReRAM storage requirements. Secondly, its 3D NoC-enabled design reduces on-chip communication latency. Our architecture outperforms the state-of-the-art in ReRAM-based graph acceleration by up to 5x in performance while consuming up to 10.3x less energy for a range of graph inputs and workloads. 
    more » « less
  2. Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures are used extensively to accelerate inferencing/training with convolutional neural networks (CNNs). Three-dimensional (3D) integration is an enabling technology to integrate many PIM cores on a single chip. In this work, we propose the design of athermallyefficient dataflow-aware monolithic 3D (M3D)NoC architecture referred to asTEFLONto accelerate CNN inferencing without creating any thermal bottlenecks.TEFLONreduces the Energy-Delay-Product (EDP) by 42%, 46%, and 45% on an average compared to a conventional 3D mesh NoC for systems with 36-, 64-, and 100-PIM cores, respectively.TEFLONreduces the peak chip temperature by 25Kand improves the inference accuracy by up to 11% compared to sole performance-optimized SFC-based counterpart for inferencing with diverse deep CNN models using CIFAR-10/100 datasets on a 3D system with 100-PIM cores. 
    more » « less
  3. Abstract Recent work has demonstrated that geometric deep learning methods such as graph neural networks (GNNs) are well suited to address a variety of reconstruction problems in high-energy particle physics. In particular, particle tracking data are naturally represented as a graph by identifying silicon tracker hits as nodes and particle trajectories as edges, given a set of hypothesized edges, edge-classifying GNNs identify those corresponding to real particle trajectories. In this work, we adapt the physics-motivated interaction network (IN) GNN toward the problem of particle tracking in pileup conditions similar to those expected at the high-luminosity Large Hadron Collider. Assuming idealized hit filtering at various particle momenta thresholds, we demonstrate the IN’s excellent edge-classification accuracy and tracking efficiency through a suite of measurements at each stage of GNN-based tracking: graph construction, edge classification, and track building. The proposed IN architecture is substantially smaller than previously studied GNN tracking architectures; this is particularly promising as a reduction in size is critical for enabling GNN-based tracking in constrained computing environments. Furthermore, the IN may be represented as either a set of explicit matrix operations or a message passing GNN. Efforts are underway to accelerate each representation via heterogeneous computing resources towards both high-level and low-latency triggering applications. 
    more » « less
  4. Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01X (up to 132.67X) speedup and a 33.82X energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69X to 2.19X speedup and consumes 4.77X to 8.91X less energy. GRAPHR gains a speedup of 1.16X to 4.12X, and is 3.67X to 10.96X more energy efficiency compared to PIM-based architecture. 
    more » « less
  5. The ending of Moore’s Law makes domain-specific architecture as the future of computing. The most representative is the emergence of various deep learning accelerators. Among the proposed solutions, resistive random access memory (ReRAM) based process-inmemory (PIM) architecture is anticipated as a promising candidate because ReRAM has the capability of both data storage and in-situ computation. However, we found that existing solutions are unable to efficiently support the computational needs required by the training of unsupervised generative adversarial networks (GANs), due to the lack of the following two features: 1) Computation efficiency: GAN utilizes a new operator, called transposed convolution. It inserts massive zeros in its input before a convolution operation, resulting in significant resource under-utilization; 2) Data traffic: The data intensive training process of GANs often incurs structural heavy data traffic as well as frequent massive data swaps. Our research follows the PIM strategy by leveraging the energy-efficiency of ReRAM arrays for vector-matrix multiplication to enhance the performance and energy efficiency. Specifically, we propose a novel computation deformation technique that can skip zero-insertions in transposed convolution for computation efficiency improvement. Moreover, we explore an efficient pipelined training procedure to reduce on-chip memory access. The implementation of related circuits and architecture is also discussed. At the end, we present our perspective on the future trend and opportunities of deep learning accelerators. 
    more » « less