<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SCALE: A Structure-Centric Accelerator for Message Passing Graph Neural Networks</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>11/02/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10634854</idno>
					<idno type="doi">10.1109/MICRO61859.2024.00050</idno>
					
					<author>Lingxiang Yin</author><author>Sanjay Gandham</author><author>Mingjie Lin</author><author>Hao Zheng</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Message passing paradigm has been widely used in developing complex Graph Neural Network (GNN) models, allowing for concise representations of edge and vertex-wise operations. Despite its pivotal role in theoretical advancement, the respective expression of edge and vertex operations, along with evolving GNN variants and datasets, has inevitably led to enormous computational complexity due to heterogeneous computation kernels. In particular, such inconsistent computation characteristics present new challenges in leveraging intermediate data reuse, ensuring both edge and vertex-wise workload balance, and sustaining system scalability.In this paper, we propose a structure-centric accelerator, SCALE, that can support a variety of message passing GNN models with improved parallelism, data reuse, and scalability. The central idea is to find latent similarities among GNN primitives such as shared dataflow structure, rather than strictly adhering to heterogeneous model structure. This serves as a hinge to homogenize inconsistencies in various GNN computation kernels. To accomplish this concept, SCALE consists of three unique designs, a novel systolic array-like architecture, a degree and vertex-aware scheduling, and a coherent dataflow tailored for fused graph and neural operations. The proposed systolic array-like architecture can support varying dataflows such as allreduce, of distinct GNN operations improving parallelism, data reuse, and throughput. The degree and vertex-aware scheduling can remedy the workload imbalance encountered in vertex and edge-wise operations. Moreover, the proposed dataflow can unify the data movement of both graph and neural operators without extra communication and storage overheads. Our simulation results show that SCALE achieves 1.82× speedup and 38.9% energy reduction on average over the state-of-the-art GNN accelerators [1]-[4].]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Graph Neural Networks (GNNs) have emerged as a promising model to comprehend complex graph-structured data, showing satisfactory performance in various applications such as social networks <ref type="bibr">[5]</ref>- <ref type="bibr">[7]</ref>, recommendation systems <ref type="bibr">[8]</ref>- <ref type="bibr">[10]</ref>, and bioinformatics <ref type="bibr">[11]</ref>- <ref type="bibr">[14]</ref>. In light of recent theoretical advancements, GNN models now incorporate complex edge and vertex-wise operations to capture nuanced graph structures and semantics. Such complex operations are typically expressed using the message passing paradigm in deep learning libraries such as Deep Graph Library and Pytorch Geometric. Yet, despite its importance, message passing GNN poses escalating computational and communication challenges on the hardware due to intricate vertex and edge-wise operations.</p><p>Even though significant efforts <ref type="bibr">[15]</ref>- <ref type="bibr">[21]</ref> have been devoted to facilitating Graph Convolutional Networks (GCN), they are inefficient in handling message passing GNNs with explicit edge operations as shown in Table <ref type="table">I</ref>. Specifically, GCN computations can be simply accelerated in the form of sparse-dense matrix multiplication (SpMM) and general matrix multiplication (GEMM). Prior works, such as AWB-GCN <ref type="bibr">[2]</ref> and I-GCN <ref type="bibr">[22]</ref>, either relied on runtime workload distribution or matrix preprocessing to overcome the sparsity issue involved in SpMMs. GCNAX <ref type="bibr">[1]</ref> utilized loop fusion and reordering to optimize SpMM kernels, thus reducing excessive off-chip memory access. However, emerging GNN models like Graph Attention Networks (GAT) <ref type="bibr">[23]</ref> involve complex attention score calculation over edges represented by sampled dense-dense matrix multiplication (SDDMM) <ref type="bibr">[24]</ref>. Higher-Order Graph Convolutional Architectures <ref type="bibr">[25]</ref> require information aggregation from non-adjacent vertices. Consequently, such intricate graph operations make it difficult to expedite GNN computations through graph reordering and SpMM optimizations.</p><p>To support message passing GNNs, FlowGNN <ref type="bibr">[3]</ref> introduced a dataflow architecture to support edge and vertex operations in a pipeline fashion. These computation units are connected by a complex interconnection network to withstand erratic communication, leading to scalability concerns. Similarly, ReGNN <ref type="bibr">[4]</ref> presented a dynamic redundancy-eliminated neighborhood message passing to improve graph data locality. However, its parallelism is also restricted by separate graph and neural operations with considerable communication overheads. More importantly, heterogeneous architectures emerge as a key bottleneck hindering the data locality of intermediate results and scalability.</p><p>Considering constantly evolving GNN models, we posit that adhering to their model structure is the primary issue that limits data locality and scalability, as each GNN operation exhibits distinct computation and dataflow patterns. To this end, we propose SCALE, a structure-centric accelerator that unifies the computation via a shared dataflow. The key idea is to uncover latent similarities among heterogeneous GNN kernels through exploiting dataflow structure. Tailoring dedicated dataflows to be consistent for both graph and neural operations enables their execution using a single computation fabric, thereby unifying the computation patterns. Consequently, it simplifies the communication complexity, improving intermediate data reuse, and system scalability.</p><p>The major contributions of this paper are as follows:</p><p>&#8226; SCALE Accelerator Design: The proposed SCALE accelerator can efficiently parallelize graph and neural operations with a coherent dataflow, thus obviating the communication complexity and storage overheads. Specifically, SCALE introduces a novel systolic array-like architecture to simultaneously fuse feature aggregation (in the form of reduce operations) and vertex update operations to eliminate graph irregularity with much-improved performance. In other words, our proposed architecture repurposes conventional systolic array architecture for the chained reduction and matrix multiplication with a coherent data movement pattern, enabling intermediate data reuse between heterogeneous operators with minimized communication distance. The proposed architecture compromises a shift-register array to meet the increased data bandwidth required by the fused operations, a novel PE architecture to directly accommodate the dependency between feature aggregation and vertex update, and a flexible ring interconnect to provide adaptive dataflow for both operations. &#8226; Degree and Vertex-aware Scheduling: Unlike conventional graph processing, GNN computations require considerable execution time for both feature aggregation and vertex update. As such, both degree and vertex information should be considered for workload balance. SCALE incorporates a degree and vertex-aware scheduling policy, which can dynamically allocate equivalent edge and vertex quantity to different processing units while reforming the workload for message reduce and vertex update. This ensures the workload balance for both aggregation and update phases for GNNs. &#8226; Flexible Dataflow for Fused Graph and Neural Operations: Given the dynamic variations in graph connectivity, feature size, and weight matrix, it requires a flexible dataflow to seamlessly orchestrate the data movement of the graph and neural operations. We propose a unique dataflow to fuse the inconsistent graph and neural operations to achieve unified data movement across computation units, optimizing the parallelism and data reuse. We evaluate the proposed SCALE, and the evaluation result shows that our design achieves 1.82&#215; speedup and 38.9% energy reduction on average over the state-of-the-art GNN accelerators <ref type="bibr">[1]</ref>- <ref type="bibr">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND AND MOTIVATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Graph Neural Networks</head><p>Graph Neural Networks (GNNs) are a category of neural networks developed to handle graph-structured data. This category includes several variants, such as Graph Convolutional Networks (GCNs) <ref type="bibr">[26]</ref>, Gated Graph Convolutional Networks (G-GCN) <ref type="bibr">[27]</ref>, GraphSAGE <ref type="bibr">[28]</ref>, and Graph Isomorphism Networks (GINs) <ref type="bibr">[29]</ref>. Given the complex and diverse graph operations, such as updating the edge and vertex embeddings, GNN computation patterns cannot be simply formulated as sparse-dense (SpMM) or general matrix (GEMM) multiplications. Instead, GNN models rely on a message passing scheme to perform such complex operations. In essence, most GNN models can likely be expressed using this message passing approach <ref type="bibr">[30]</ref>- <ref type="bibr">[33]</ref>.</p><p>Message Passing in GNNs: Message passing in GNNs is a key mechanism to incorporate local neighborhood information and is generally structured into two main stages: the aggregation and update steps. Given a graph G = (V, E), where V is the set of vertices and E is the set of edges, the aggregation step computes a representation for each vertex based on its neighbors' features:</p><p>Here, m k v is the aggregated message for vertex v at layer k, N (v) represents the neighbors of v, and h k-1 u are the features of node u from the previous layer. The update step then integrates the aggregated message with its own features:</p><p>These two stages form the core of message passing and are foundational to various GNN models <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>. The aggregation phase workload (AGGREGATE k ) is dependent on the degree of each vertex, while the update phase workload (UPDATE k ) relies solely on the number of vertices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Motivation</head><p>Despite recent efforts <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[37]</ref>, several challenges remain unsolved in accelerating message passing GNNs, limiting their parallelism, scalability, and data reuse. To fully uncover such challenges, we conduct an initial study across different message passing GNN models and graph datasets.</p><p>Structure-aware Workload Partitioning: Message passing paradigm involves extensive edge and vertex-wise operations, highly contingent on graph structure. Consequently, unlike edge or vertex-centric graph partitioning <ref type="bibr">[3]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[38]</ref>, it is imperative to ensure even distribution of edges and vertices among processing elements respectively. Based on our application profiling, as shown in Fig. <ref type="figure">1</ref>(a), FlowGNN <ref type="bibr">[3]</ref> and PowerGraph <ref type="bibr">[36]</ref> present 40-50% PE under-utilization of both compute engines, in which only vertex or edge quantity is considered at the workload scheduling.</p><p>Scalability Concern due to Heterogeneous Kernels: Message passing GNN phases such as aggregation and update involve operating on irregular graph data and regular computation on tensors respectively. Given their distinct characteristics, architectures have to be tailored separately for each operation phase. These compute engines are connected by either a multistage or a crossbar network to handle the data communication of intermediate data. Therefore, as the network scales, the communication latency increases. For example, in Benes network <ref type="bibr">[39]</ref>, the hop count of each intermediate data is 2log 2 N . However, the computation time for each intermediate result remains constant regardless of network size. Conventional architectures <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[40]</ref> typically employ pipelining to mask communication overheads with the computation. However, it remains difficult to efficiently overlap communication latency with computation due to their disproportionate scaling <ref type="bibr">[41]</ref>. Based on our study, we observed that the communication time will not be overlapped when the PE size exceeds 128. This will eventually increase the overall execution time by 2-3&#215; as shown in Fig. <ref type="figure">1(b</ref>). Note that exposed communication refers to the portion of the communication latency that hinders the execution of the update phase.</p><p>Compromised Intermediate Data Reuse: In addition, reusing intermediate data is critical in GNN acceleration as illustrated in Fig. <ref type="figure">1</ref>(c), as intermediate data is predominant (approximately 50%) in overall GNN data including graph, input, weight, and output. Theoretically, the intermediate data reuse is proportional to graph data and feature size, whereas graph data reuse is subject to mutually-shared nodes <ref type="bibr">[4]</ref>. While previous study <ref type="bibr">[22]</ref> emphasized data reuse optimization in SpMM kernels, it remains a significant challenge to exploit intermediate data reuse in heterogeneous computation kernels at the register level. This requires a strategic design from both dataflow and architecture aspects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED ARCHITECTURE DESIGN</head><p>The key idea of SCALE is to identify a shared dataflow among distinct message passing operations. As such, a coherent dataflow can be leveraged to fuse multiple operations, enabling fine-grained parallelism and increased data reuse at the register level. Given that all operations are performed upon a shared graph structure, graph structure serves as the cornerstone to unify dataflows of GNN operations. Fig. <ref type="figure">2</ref> illustrates SCALE's coherent dataflow for both GNN phases. Unlike conventional dataflows in the systolic array architectures, the proposed dataflow is similar to allreduce operation in collective communication, which consists of 'reduce-scatter' and 'all-gather' operations <ref type="bibr">[42]</ref>. SCALE first reorganizes the irregular computation involved in aggregating information from neighboring vertices during the aggregation phase into a linear chain of reduce computations. This enables the aggregation engine of SCALE to perform aggregation by shifting data in the forward direction adhering to a regular communication pattern, corresponding to the 'reduce-scatter'. The results of the aggregated vertices are then directly processed by the update engines while communicating</p><p>Aggregation Phase</p><p>8 4 9 Update Phase 1 4 6 10 w 1 w 2 w 3 w 0 Destination Vertex Source Vertex Edges 3 11 7 5 2 0 1 10 4 6 9 8 0 9 6 2 4 7 1 3 11 8 5 10 3 6 5 7 10 11 w 0 w 1 w 2 w 3 Edge Parallelism, Forward Direction Vertex Parallelism, Backward Direction Fused Operator 1 4 6 10 Aggregated Features Weight Filters Aggr. Engine Update Engine Intra-PE Comm.</p><p>Inter-PE Comm.</p><p>Figure <ref type="figure">2</ref>: The proposed edge and vertex parallelism in SCALE. in the backward direction, referring to the 'all-gather'. For example, vertices 0, 2, and 1 are loaded to PE0, PE1, and PE2 respectively. Vertex 0 will be forwarded to PE1 and aggregated with Vertex 2, and the intermediate result will be forwarded to PE2. Once the aggregation is completed at PE2, the aggregated feature will be forwarded to the update engine within PE2.</p><p>The aggregation feature will be multiplied with distributed weight elements (w 0 , w 1 , w 2 , and w 3 ) via backward direction.</p><p>As such, all the intermediate results are exchanged and stored at the register level, increasing the data reuse. Meanwhile, the communication distance for each operation is one hop. SCALE exploits the edge parallelism at the aggregation phase, where multiple edge reduce operations are distributed and performed in parallel. On the other hand, it parallelizes the vertex update by distributing the weight matrix among processing units. Both edge and vertex parallelism are unified in one architecture via a coherent dataflow. As such, the aggregated features will be forwarded for vertex update within each PE to enable operator parallelism (e.g., aggregation and update).</p><p>A. Overall SCALE Architecture Fig. <ref type="figure">3</ref>(a) depicts the architecture of the proposed SCALE accelerator. SCALE consists of the following main components: a multi-bank global buffer, a task controller, data loaders, task dispatchers, and a flexible processing element (PE) array. The multi-bank global buffer is used to store the graph information (vertex and edge) and the weight matrix. The task controller schedules edge and vertex tasks and assigns them to the array of task dispatch units. The task dispatch unit orchestrates the data movement of their respective tasks from the global buffer to the PE array using the data loader to forward the data to the PE array. The PEs are connected by a flexible network, which can be dynamically sized to multiple PE sub-arrays. Each PE can support both message aggregation and vertex update phases at the same time. We will detail the design of the PE array to simultaneously support both message aggregation and vertex update phases in the following.</p><p>Task Dispatch 1 Task Dispatch 3 Task Dispatch 0 Task Dispatch 2 Control Signal Data Path Interconnect Link &#192; Data Loader &#192; Data Loader &#192; Data Loader &#192; Data Loader Link Switch PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Off-chip DRAM Task Controller Global Buffer Data Prefetcher (a) (b) Aggregation Engine Reg. MUX 4 MUX 6 7 5 Reg. MUX Reg. MUX MUX 6 7 Reg. MUX MUX MUX Update Engine Weight&#192; Buffer ARB. 0 1 3 2 ARB. Register File 6 6 7 7 DEMUX Reg. Reg. Activation Unit Output&#192; Buffer Scalar Buffer Port # Description Aggregated Features From East PE Update Engine 0 Aggregated Features To West PE 1 2 Updated Vertex From South PE 3 Updated Vertex To&#192; North PE/Global Buffer Aggregation Engine 4 Partial Aggregated Features From West PE 5 Partial Aggregated Features To East PE 6 Vertex/Edge Feature From West Register&#192; 7 Vertex/Edge Feature To&#192; East Register&#192; </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. SCALE PE Array Architecture</head><p>The proposed PE array is a systolic array-like architecture, which simultaneously enables three types of parallelism, namely edge (i.e., feature aggregation), vertex (i.e., vertex update), and operator (i.e., the dependency between aggregation and update) parallelism. Each row of the PE array is interconnected in a ring-like topology. The wrap-up link consists of a set of link switches (i.e., transistors) to disable the signal propagation between segmented short links. By doing so, a long ring can be divided into several shorter rings. These small rings can be configured to handle various graph mapping and dataflow. The forward ring direction within this structure facilitates the reduce operation required for message aggregation. On the other hand, the ring's backward direction supports the vertex update. Any operator dependency is managed within each PE once the feature aggregations are completed. Consequently, the proposed PE array can achieve high parallelism without incurring extra storage or communication overheads.</p><p>However, several challenges arise when designing the proposed PE due to the irregular nature of the graph structure, coupled with varying degree numbers. Even though EnGN <ref type="bibr">[15]</ref> attempts to perform ring-based reduce, it requires distributing the vertex features over the full length of the PE array because of the vertex-based mapping. Additionally, conventional designs often require aggregated features to be sent to global buffers or a centralized update engine. To overcome these limitations, we present two distinctive designs, aggregation and update engines within each PE, as illustrated in Fig. <ref type="figure">3(b)</ref>.</p><p>The aggregation engine is equipped with a dedicated multiply-and-accumulate (MAC) unit, along with a register array. The adder, multiplier, and scalar buffer within this structure are configurable, thereby enabling various aggregation operations essential for GNN models. Not only can each aggregation engine transmit the aggregated feature to an adjacent PE, but it can also forward them to the update engine housed within the same PE. In addition, each aggregation engine has a shift register array, comprising an array of double buffers. It has the capacity either to provide feature or weight data to MAC units or to forward the information to adjacent register arrays. The function of the update engine is to execute vector-vector multiplication, and it includes a weight buffer, an output buffer, and an activation unit for non-linear functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) Aggregation Phase:</head><p>The aggregation phase either leverages the edge or feature parallelism to perform multiple reduce operations simultaneously. To support this idea, the difficulty is two-fold: <ref type="bibr">(1)</ref> The vertex features need to be distributed to the PE array, and each feature should be aggregated with its associated reduce operation; (2) the reduce operations increase data bandwidth requirements, where multiple features result in just one aggregated feature. To solve the mentioned challenges, we operate on the individual feature of the vertices rather than the entire feature vector reducing the bandwidth requirement across the aggregating PEs. Then, we utilize a task dispatcher to distribute the feature workload to the PEs and a shift register array with double buffers, storing feature values for multiple vertices and edges, to overlap the latency of feature distribution with feature aggregation.</p><p>For example, as shown in Fig. <ref type="figure">4</ref>(a), four reduce operations will be performed on two PEs. Each reduce operation includes various vertices, in which source vertices send the features to the destination vertices (a, b, c, and d). These feature aggregations will be performed on the PE ring hop by hop. We first map features of each reduce operation to the aggregation engines. As such, the shift registers at each PE will receive two features. For example, features (a 0 0 and c 0 1 ) are loaded to the shift registers at PE0. After that, the register array pops the features up to the MAC units in the aggregation unit. Each aggregation unit will perform the reduce operation, receiving one operand from its upstream PE and another from its local register array. For example, a 0 0 is popped to the MAC units at cycle 1, and it will be forwarded to PE1 and added with a 0 1 at the next cycle. The accumulated result, m 0 a , will be further forwarded to the next PE until it finishes aggregating with all the source vertices. The final accumulated result corresponding to vertex a, M 0 a , will be sent to the update engine at PE1 during cycle 4. As such, the aggregated features can be directly sent to the next operation without incurring data 1 b d 0 b 0 c 3 Dispatch Queues b 1 c a c 2 Circular Shift Register a 1 c 1 a 2 c 1 a 1 Reg. Reg. MUX MUX Task Dispatcher Vertex Feature Reader Data Loader Task List PE 0 PE 1 b 0 0 c 3 0 a 2 0 c 1 0 a 0 0 d 0 0 c 0 a 0 c 2 0 a 1 0 b 1 0 c 0 0 t = Dispatch Controller task 0 task 1 Control Signal Global Buffer Access 0 Task ID a 0 Task List b 0 a b 1 b c 0 d 0 c d c 3 c 2 a 2 a 1 c 1 0 1 2 3 4 5 6 d d 0 b 0 Figure 5: An example of task dispatcher feeding vertex features to the PE array.</p><p>movement across the memory hierarchy. This eliminates the communication complexity and memory operations. Note that using a ring interconnect between the PEs allows arbitrarylength subgraphs to be mapped onto the PE ring as large workloads (vertices with high degrees) can wrap around the PE ring multiple times. Fig. <ref type="figure">4</ref>(b) shows that subgraphs 3 with four aggregations can be mapped to a PE ring of size 2.</p><p>Note that the SCALE microarchitecture supports a wider variety of aggregation functions employed by GNN architectures. A key property of the aggregation functions is permutation invariance <ref type="bibr">[29]</ref>. This indicates that the ordering of the input features does not affect the output. Reduction operation is often associative and commutative, ensuring permutation invariance over the inputs <ref type="bibr">[43]</ref>. By representing aggregation functions as a reduction operation over the features, SCALE ensures support for emerging GNN models.</p><p>Task Dispatcher: Fig. <ref type="figure">5</ref> illustrates the operation of a task dispatcher. The task dispatcher iterates over every source and destination vertex of all tasks in the task list and loads a portion of their respective features with the data loader. These features are sent through a circular shift register, similar to a barrel shifter, to reorder the vertex features. Note that the circular shift register is not the same as the shift register in the register array. The circular shift register is used to reorder the vertices, whereas the shift registers in the register array are used to supply the data to the PEs. For example, vertex features a 0 1 and c 0 1 are read from the global buffer by the data loader. This pair of features is sent through the circular shift register where it is shifted i%T n ring times where i is the index in the task list and T n ring is the number of tasks assigned to a PE ring. In this example, i is 1, and T n ring is 2, leading to a shift Register array 0 loads data from task dispatch.</p><p>Register array 1&#192; sends data to the MAC unit. Shift Register Array: Naturally, the fused graph and neural operations, when performed simultaneously, increase the data bandwidth requirements. To overcome this problem, we propose a double buffer design in the shift register array. Fig. <ref type="figure">6</ref> illustrates the shift register array architecture, in which the register arrays of two PEs are interconnected. Specifically, the proposed shift register array is arranged in a mesh-like network and supports both horizontal and vertical data transfer. The features are propagated horizontally to fill up local registers. Once the features are fully loaded into PEs, they are delivered to the MAC units in the aggregation engine vertically. For example, as shown in Fig. <ref type="figure">6</ref>, the task dispatcher has already loaded the features (a i and c i ) associated with two reduce operations into a 2&#215;2 register array. At cycle 0, features a 0 0 and c 0 0 are supplied to the MAC units. Afterward, features c 0 1 and a 0 1 will be propagated to the MAC units. In the meantime, the register array 1 starts preloading the feature data corresponding to the next set of computations. For example, at cycle 0, features c 0 2 and a 0 are loaded to the first column of the register array 1, and all of them will be shifted horizontally at the next cycle. To maintain the full utilization of the PE ring, the register array size should be set to at least N when generalizing the register array size to an N &#215; N array. It is important to</p><p>Weight Buffer Input Buffer Output Buffer PE1 Weight Buffer Input Buffer Output Buffer w 1 0 w 1 1 w 1 2 w 1 3 w 0 0 w 0 1 w 0 2 w 0 3 PE0 Weight Buffer Input Buffer Output Buffer PE3 Weight Buffer Input Buffer Output Buffer PE2 w 3 0 w 3 1 w 3 2 w 3 3 w 2 0 w 2 1 w 2 2 w 2 3 Weight Buffer Input Buffer Output Buffer PE1 Weight Buffer Input Buffer Output Buffer PE0 Weight Buffer Input Buffer Output Buffer PE3 Weight Buffer Input Buffer Output Buffer PE2 Cycle 0 Weight Matrix Aggregated Features Updated Features w 0 0 w 0 2 w 0 3 w 1 0 w 1 1 w 1 2 w 1 3 w 0 1 w 2 0 w 2 1 w 2 2 w 2 3 w 3 0 w 3 1 w 3 2 w 3 3 Vertex a Vertex b Vertex c Vertex d Vertex u a Vertex u b Vertex u c Vertex u d Weight Buffer Input Buffer Output Buffer PE1 Weight Buffer Input Buffer Output Buffer PE0 Weight Buffer Input Buffer Output Buffer PE3 Weight Buffer Input Buffer Output Buffer PE2 Weight Buffer Input Buffer Output Buffer PE1 Weight Buffer Input Buffer Output Buffer PE0 Weight Buffer Input Buffer Output Buffer PE3 Weight Buffer Input Buffer Output Buffer PE2 w 1 0 w 1 1 w 1 2 w 1 3 w 3 0 w 3 1 w 3 2 w 3 3 w 2 0 w 2 1 w 2 2 w 2 3 w 1 0 w 1 1 w 1 2 w 1 3 w 3 0 w 3 1 w 3 2 w 3 3 w 2 0 w 2 1 w 2 2 w 2 3 w 1 0 w 1 1 w 1 2 w 1 3 w 3 0 w 3 1 w 3 2 w 3 3 note that the task dispatcher is only connected to the PEs in the PE array's leftmost column. To move data across the array, the dispatcher starts a shift operation while simultaneously pushing data into the register array.</p><p>2) Update Phase: The vertex update occurs once the accumulation of a vertex's features is complete. The fundamental computations of the vertex update are matrix-vector multiplications. In this process, feature vectors are multiplied with a weight matrix. Since the weight matrix is identical and shared across feature vectors, a weight-stationary dataflow is employed for the update phase. Specifically, we evenly distribute the computation of the update phase across a set of PEs by partitioning the weight matrix into equal chunks, where each PE holds one chunk. The weight matrix mapping is determined by matrix dimension and ring size, which will be handled by the task controller. This approach helps eliminate data duplication, enhancing the effective memory capacity. The weight matrix is pre-loaded into the PE array, while the aggregated feature is passed across PEs to facilitate computation with each weight partition. The data movement for vertex updates is orchestrated in a direction opposite to that of aggregation, fully utilizing both directions of the ring.</p><p>For example, as shown in Fig. <ref type="figure">7</ref>, the feature vectors (a, b, c, and d) are received from the aggregation engine and are stored in the update engine. Each feature vector includes four elements (M 0 a , M 1 a , M 2 a , and M 3 a ) and needs to be multiplied by the weight matrix. As the weight matrix is partitioned and distributed into the PE array, the weight vectors will be temporally stored at each PE for vertex update. The feature vector will move along the PE array to be multiplied with each weight vector. For example, the aggregated features (M 0 a , M 1 a , M 2 a , and M 3 a ) will be multiplied by a weight vector (w 0 0 , w 1 0 , w 2 0 , and w 3 0 ). The results (u j i , i &#8712; {a, b, c, d} and j &#8712; {0, 1, 2, 3}) will be eventually sent back to global buffers through the vertical links in the PE arrays. The vertical links connect each PE with the corresponding PE in the row above it. As such, only the PEs of the topmost row are connected to the global buffer to write the updated features back to the global buffer. While the PEs of the other rows perform shift operations to send the updated feature to the row above it. This ensures design scalability as not all PEs need a direct connection to the global buffer. The aggregated feature vectors will be forwarded to the next hop. We note that each aggregated feature has to traverse N -1 hops (N is the ring size) to accomplish the update operation.</p><p>3) Workload Imbalance and Ring Size: The efficiency of the proposed architecture is clearly influenced by the variability in vertex degree and weight matrix dimension size. Specifically, the edge parallelism in reduce operations is linked to the vertex degree since the number of features corresponds to the vertex degree. With a fixed PE array size, accommodating varying sizes of reduce operations becomes a complex task. Additionally, the size of the weight matrix determines the spatial parallelism, where each weight vector is divided among multiple PEs. These factors collectively contribute to the challenges of fusing the aggregation and update phases into a cohesive dataflow. Such integration demands meticulous design in terms of workload scheduling and PE array size, which will be discussed in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. PROPOSED SCHEDULING POLICY</head><p>As mentioned, the message passing GNNs follows a neighborhood aggregation scheme, where the feature vector of a vertex is computed by recursively aggregating and transforming feature vectors of its neighboring vertices <ref type="bibr">[29]</ref>. This indicates that the computation associated with the aggregation phase depends on the number of neighboring vertices (represented by the edges of a vertex). Further, the update phases transform the feature representation of each vertex, indicating that the workload is proportional to the number of vertices. In distributing the workload across the proposed PE array, the ideal scenario would involve an even task allocation of edge and vertex to avoid workload imbalance at different stages. However, the varying degrees of vertices in power-law graphs pose a complicated challenge to the workload distribution.</p><p>In the context of workload scheduling, prior work typically falls into one of two categories: vertex-aware scheduling <ref type="bibr">[44]</ref>- <ref type="bibr">[46]</ref> and degree-aware scheduling <ref type="bibr">[15]</ref>, <ref type="bibr">[47]</ref>, <ref type="bibr">[48]</ref>. Vertexaware scheduling allocates an equal number of vertices to computing units. To illustrate this, we consider the example graph shown in Fig. <ref type="figure">8(a)</ref>. Vertex-aware partitioning forms four tasks, each containing an equal number of vertices but with differing degrees, as depicted in Fig. <ref type="figure">8(b</ref>) with each task assigned to one PE. While this method balances the workload during the update phase, it leads to an unbalanced load in the aggregation phase, resulting in unbalanced PE utilization. However, degree-aware scheduling organizes vertices into tasks in such a way that the total degree of the vertices</p><p>(a) An Example Graph and Its Edge List Representation (d) Degree and Vertex (DV)-aware Scheduling Balance Update Workload between Task Groups Task Group # of Edges a,e b,f c,g 0 4 8 d,h 0 1 2 3 # of Vertices a,e b,f c,g 0 2 4 d,h 0 1 2 3 Vertex ID 8 3 (c) Degree-aware Scheduling # of Edges Task ID a,b,h c,d,e f 0 4 8 g 0 1 2 3 # of Vertices Task ID a,b,h c,d,e f 0 2 4 g 0 1 2 3 Balanced&#192; Edges # of Edges Task ID a,b,h f c,d,e 0 4 8 g 0 1 2 3 # of Vertices Task ID a,b,h f c,d,e 0 2 4 g 0 1 2 3 1 b(2), f(6) 2 c(2), g(6) Task ID Vertex ID (Edges #) 0 a(3), e(2) 3 d(2), h(1) 1 c(2), d(2), e(2) 2 f(6) Task ID Vertex ID (Edges #) 0 a(3), b(2) 3 g(6), h(1) 1 f(6) 2 c(2), d(2), e(2) Task ID Vertex ID (Edges #) 0 a(3), b(2), h(1) 3 g(6) 1 c(2), d(2), e(2) 2 f(6) Task ID Vertex ID (Edges #) 0 a(3), b(2), h(1) 3 g(6)</p><p>Task List within each task is balanced. 8(c) provides an example of degree-aware scheduling, with four tasks created to have an equal number of edges. Although this approach ensures balance during the aggregation phase, it may create imbalances during the update phase due to the varying numbers of vertices in each task, which leads to unbalanced PE utilization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(b) Vertex-aware Scheduling</head><note type="other">Task List Imbalanced Edges 5 Balanced Vertices 4 4 Task List Task List</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Degree and Vertex-aware Scheduling</head><p>Finding a balance between the number of edges and vertices can be particularly challenging, as graph data often exhibit high irregularities. To tackle this challenge, we introduce a degree and vertex-aware task scheduling algorithm, in which a greedy heuristic is proposed for this bin-packing problem <ref type="bibr">[49]</ref>. The process begins with degree-aware scheduling to create edge-balanced tasks. These tasks are then combined into equally distributed task groups using a modified vertex-aware scheduling approach. This two-step method ensures balanced edge and vertex workloads across all task groups.</p><p>Algorithm 1 outlines the pseudo-code for the proposed task scheduling method, which consists of two main steps. In the first phase, our goal is to organize vertices into tasks that are balanced in terms of edges. We approach this as a modified version of the bin-packing problem, where each "bin" or task has a size equivalent to the number of edges in that task. Unlike the traditional bin-packing problem, which seeks to minimize the number of bins, our version fixes the number of bins to the maximum number of tasks the hardware can handle, a limit set by the register array size. As outlined in lines 17-30 of the algorithm, we apply the first-fit heuristic, a common approach used for runtime bin packing. For SCALE, we set the number of tasks, T n , and number of task groups, G n , to be equal to the number of PEs and rings, respectively.</p><p>The second phase involves grouping these edge-balanced tasks into task groups, ensuring that the number of vertices within each group is balanced. It's important to note that the number of task groups is constrained by the number of available rings. To achieve the balance, we first sort the tasks according to the number of vertices in each task, as indicated in line 4. We then use a simple modulo operation in line 13 to identify the index for the task group and place the task there. This method allows us to pair tasks with high vertex workloads with those having low vertex workloads, resulting in task groups that have balanced vertex workloads.</p><p>Fig. <ref type="figure">8</ref>(d) illustrates the degree and vertex-aware scheduling for a 2 &#215; 2 PE array. We create four tasks (Task 0 -3), each containing a similar number of edges (i.e., 6). Take Task 0 as an example; it includes 6 edges from three vertices (vertex a, b, and h). These edges represent the reduce operations to be performed at each PE during the aggregation phase. We further combine the tasks into pairs, referred to as "task groups". Each group of tasks is assigned to one PE ring. For instance, Task 0 and Task 1 are grouped together as group 0, which has four vertices (vertex a, b, h, and f). This number represents the vertex updates for the group. With two PE rings, we form two task groups, ensuring that both the aggregation and update phases have a similar workload in both rings. It's worth noting that Task 0 includes three small-degree vertices and produces three aggregated feature vectors. Meanwhile, Task 1 includes a large-degree vertex and produces one aggregated feature. While the edge and vertex workloads among the task groups are balanced, there may be some vertex imbalance within a task group due to the greedy policy. Our architecture and mapping strategies, however, can tolerate this imbalance. As the weight matrix is distributed across PEs and the aggregated vertex features circulate through all PEs in a ring during the update phase, it ultimately equalizes the workload among PEs.</p><p>Following Algorithm 1, we implement the proposed scheduling algorithm in hardware allowing SCALE to schedule workloads during runtime. Task scheduling involves creating workloads and writing them to the task lists of each task dispatcher unit. To vary the scheduling latency, the task scheduler creates tasks over a subset of the graph vertices defined by batch size B. To minimize task scheduling overheads, we decouple the task scheduling and execution by using a double-buffered task list in the task dispatcher, allowing the task controller to schedule new tasks while the previously scheduled tasks are being executed. This allows SCALE to overlap the task scheduling latency with the task execution. By employing such a runtime mechanism, we avoid preprocessing the graph data. To hide the task scheduling latency with the task execution, SCALE uses a performance model, as described in IV-B, to estimate their latency and restricts the batch size parameter, B to ensure that the task scheduling does not hinder task execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance Model</head><p>The batch size, B, indicates the number of vertices operated by the task scheduler to allocate tasks during one pipeline phase. SCALE selects a suitable value for B such that the task scheduling does not hinder the first pipeline stage of task execution. Specifically, the task scheduling latency, t ts must be less than the aggregation latency, t agg of the tasks. SCALE implements an analytical model of task scheduling and task aggregation to determine their respective execution time for a given subgraph with an average degree of D avg .</p><p>The task scheduling latency, t ts , from Algorithm 1 can be written as the sum of time to create tasks and task groups. Operations involved in creating tasks, as indicated by the nested for-loops in Algorithm 1 (lines 24-29), require, on average, B &#215; Log(T n ) on-chip memory access. Task group creation, as indicated by sorting tasks and a for loop in lines 4-15, takes (T n &#215; Log(T n ) + T n ) on-chip memory accesses. Thus the task scheduling latency can be written as</p><p>Here, t ocm is the access latency for on-chip memory. Similarly, the aggregation of a task involves the PEs performing B &#215; D avg reduce operation and inter-PE communication for each feature. Moreover, the same task list can be reused to perform the computation associated with F n features of the vertex. Since these reduce operations are performed in parallel by T n PE, the task scheduling latency can be written as</p><p>Here, t reduce and t comm are the latency to perform a reduce operation and inter-PE communication, respectively. If the t ts is larger than t agg , the aggregation engines of the PE will remain idle, leading to resource underutilization. Therefore, for a given accelerator configuration, SCALE chooses the batch size B carefully such that t ts &lt; t agg . The sensitivity study is provided in Section VII-F.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. PROPOSED DATAFLOW AND MAPPING</head><p>Both dataflow and mapping choices affect the performance by exploiting the temporal and spatial data locality via loop interchanging and spatial parallelism <ref type="bibr">[50]</ref>- <ref type="bibr">[52]</ref>. The weight matrix is of a relatively lower dimensionality compared to CNNs. Therefore, conventional CNN dataflow and mapping methods have limited applicability to message passing-based GNN acceleration.</p><p>In this work, the principal objective of the mapping strategy is to parallelize both feature aggregation and vertex update in a coherent dataflow. To attain this, the proposed flexible ring plays a critical role in achieving spatial parallelism. For instance, the weight matrix must be distributed across different PEs within the same PE ring, thereby mitigating data duplication and resource under-utilization. However, the dimensionality of the weight matrix differs among datasets. In the update phase of the GCN's second layer on the Cora dataset, a weight matrix of dimension 16&#215;7 is utilized, while for Nell, the dimensions are 64&#215;168. This necessitates a ring size adaptable to diverse matrix dimensions. For example, if the ring size S ring is small, then the sum of weight buffers B weight in the ring may be smaller than the entire weight matrix W . This would incur off-chip memory accesses to load portions corresponding to the weight matrix that is not present in the weight buffers. On the other hand, employing a ring size S ring larger than the total available computations in the update phase would reduce the PE utilization rate. This is caused by certain update engines being idle due to the lack of available computations. Formally, we show the optimal range of ring size as</p><p>where R weight and C weight are the row and column dimensions of the weight matrix, respectively. Upon determining the ring size, the flexible PE array will be configured accordingly. As depicted in Fig. <ref type="figure">9</ref>(a), if the ring size is set to 2, a row of the array is configured into two PE rings, resulting in the formation of eight rings to facilitate the concurrent execution of feature aggregation and vertex update. To illustrate this, Fig. <ref type="figure">9</ref>(b) shows aggregated feature vectors, such as M 0 a , M 1 a , M 2 a , and M 3 a , are distributed among PEs. In parallel, each column of the weight matrix, like w 0 0 , w 1 0 , w 2 0 , and w 3 0 , is also distributed to PEs by the task controller. The aggregated feature vectors only need to pass through two PEs, where they are then multiplied by each corresponding weight partition. The resulting products, such as u 0 a , u 0 b , u 1 b , and u 1 a , can be directly written back to the global buffer. If an outer product were chosen, it would require additional buffers at each PE to hold the intermediate data. This method would cause the data movement to be inconsistent with the aggregation phase and would also introduce additional computations to accumulate all the partial results.</p><p>Once the optimal ring size is determined, the PE array will be configured. As the size of the ring is determined by the dimension of the weight matrix used in the update phase of each layer, we reconfigure the rings between the execution of layers. Note that GNNs have fewer layers than other neural networks (e.g., CNN), so the reconfiguration overheads, involving simple switch toggling, are negligible even if SCALE reconfigures the ring size for each layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EXPERIMENTAL SETUP</head><p>We implemented SCALE using a validated cycle-accurate C++ simulator to model the behavior of the hardware. We integrate our simulator with Ramulator <ref type="bibr">[53]</ref> configured as Highbandwidth memory (HBM) with a bandwidth of 256 GB/s to model the off-chip storage. We used single-precision floating point datatype according to IEEE 754 for our evaluation. We evaluated the performance of SCALE using three citation graphs (Cora, CiteSeer, and PubMed) <ref type="bibr">[54]</ref>, one knowledge graph (Nell) <ref type="bibr">[55]</ref>, and one large post graph (Reddit) <ref type="bibr">[28]</ref>. The detailed size and length of features in different layers of data sets are summarised in Table <ref type="table">II</ref>. Additionally, we used the most popular representative GNN models: GCN <ref type="bibr">[26]</ref>, Gated-GCN (G-GCN) <ref type="bibr">[27]</ref>, GraphSage-Pool (GS-PL) <ref type="bibr">[28]</ref>, and GIN <ref type="bibr">[29]</ref>. The programming model that we used is similar to Deep Graph Library (DGL) and PyTorch Geometric.</p><p>Table II: Structure of the Graph Datasets for the 2-layer GNN. Datasets Vertices Edges Average Feature Length Degree Cora 2,708 10,556 3.9 1,433 -16 -7 CiteSeer 3,327 9,104 2.7 3,703 -16 -6 PubMed 19,717 88,648 4.5 500 -16 -3 Nell 65,755 251,550 3.8 61,278 -64 -210 Reddit 232,965 114,615,892 492 602 -64 -41</p><p>CAD Tools and Methodology: To estimate the power, area, and timing characteristics of SCALE as an ASIC accelerator, we implemented it using Verilog. We synthesized it using Synopsys Design Compiler with the TSMC 32 nm standard library to learn its timing characteristics that we used in our simulator validating the accuracy of our model. We set the clock frequency at 1GHz. We performed RTL simulations to generate the waveform activity file to observe the switching activity of the logic gates. Next, we used Synopsys PrimeTime PX with the waveform activity file to measure the dynamic and static power consumption of SCALE. For the area and power of on-chip buffers, we employed Cacti 6.0 with 32 nm technology <ref type="bibr">[56]</ref>.</p><p>Baseline Platforms: To evaluate the efficiency and scalability of SCALE, we compare it with prior state-of-theart GNN accelerators such as ReGNN <ref type="bibr">[4]</ref>, FlowGNN <ref type="bibr">[3]</ref>, AWB-GCN <ref type="bibr">[2]</ref>, and GCNAX <ref type="bibr">[1]</ref>. We model the baseline architectures using our C++ simulator employing their respective optimizations. For a fair comparison, all the baseline accelerators are scaled to have the same clock frequency and the same number of single precision MAC units as SCALE. As FlowGNN utilizes a hybrid engine with node transform units performing the update phase and message passing units performing the aggregation phase, we use twice as many message passing units as node transform units. Although GCNAX and FlowGNN perform optimizations such as parallelization strategies and loop fusion, they suffer from imbalanced workloads in their processing units when scaling up the number of MAC units. Our reported latency of GCNAX and FlowGNN accounts for this workload imbalance. Additionally, we have scaled the bandwidth and on-chip memory to match SCALE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. EVALUATION RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Performance Analysis</head><p>We configure SCALE as 32 &#215; 16 PE array with a 4 MB global buffer. Each PE has 6 KB local buffers (4 KB update engine buffer and 2 KB aggregation engine buffer) and 2 MAC units, so the total MAC units of SCALE is 1024. We utilize the number of execution cycles as the performance metric. Fig. <ref type="figure">10</ref> shows the speedup compared to other state-of-the-art GCN and GNN accelerators. SCALE, on average, is 1.62&#215; and 2.01&#215; faster than AWB-GCN and GCNAX for the GCN model, respectively. For other models such as G-GCN, GS-PL, and GIN, SCALE, on average, is 1.57&#215; and 1.80&#215; faster than FlowGNN and ReGNN, respectively. The performance improvement primarily stems from the balanced workload with improved PE utilization and the coherent dataflow with simplified communication patterns. As shown in Fig. <ref type="figure">11</ref>, the Cora CiteSeer PubMed Nell Reddit Figure 11: Latency Breakdown.</p><p>exposed communication latency is reduced by up to 87.56% as compared to baseline architectures. Next, we evenly distribute the workload that considers both edge and vertex variations, leading to better PE underutilization, which contributes up to 50.35% latency reduction in both aggregation and update phases. FlowGNN and GCNAX suffer from high workload imbalance in the aggregation phase due to their fixed workload assignment strategy, which limits their performance. On the other hand, although AWB-GCN mitigates the workload balance issue, they do not pipeline both phases of GNN computation. This comes with a considerable amount of redundant memory accesses. ReGNN eliminates redundant computation, but it employs disjoint engines for both phases and suffers from workload imbalance in the aggregation phase.</p><p>Although SCALE shows considerable improvement over the prior art for Cora, CiteSeer, PubMed, and Nell, its performance is slightly worse than ReGNN on Reddit. The reason is twofold. First, Reddit graph dataset presents better regularity in graph connectivity, in which edge and vertex degrees show high similarity. Coupled with a low feature length, Reddit inherently exhibits a balanced workload on baseline accelerators. Additionally, Reddit has high graph-level data reuse, in which a pair of vertices is connected to a set of mutually shared vertices. Based on our dataset profiling, 75.5% of aggregation operations can be eliminated in Reddit. As such, ReGNN, on the other hand, outperforms SCALE as it primarily aims to minimize the large amount of redundant computation in Reddit. However, it must be noted that such redundancy removal techniques are orthogonal to SCALE. To further understand the performance benefits of SCALE, we implement the redundancy removal strategy to reprocess the graph dataset, and</p><p>Table. III shows that SCALE with redundancy removal can outperform ReGNN by an average of 1.23&#215; on the Reddit dataset across different models. Table III: Normalized speedup of SCALE with redundancy removal compared to ReGNN. Models Datasets Cora CiteSeer PubMed Nell Reddit GCN 2.15 2.31 1.98 2.07 1.13 G-GCN 2.22 2.36 1.92 1.85 1.34</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Scalability Analysis</head><p>To compare the scalability, we normalize the accelerator's speedup over AWB-GCN configured with 512 MACs. The array dimensions of the PE array in our design are set as 16&#215;16, 32&#215;16, 32&#215;32, and 64&#215;32 for 512, 1K, 2K, and 4K MACs, respectively. We prefer to increase the row dimension rather than the column dimension, as increasing the column dimension will lead to a larger shift register array. As shown in Fig. <ref type="figure">12</ref>, SCALE shows better scalability than prior accelerators showing an average speedup of 12.07&#215; compared to speed up of 7.61&#215;, 6.49&#215;, 7.3&#215;, and 6.68&#215; for AWB-GCN, GCNAX, ReGNN, and FlowGNN respectively when configured for 4K MACs. SCALE shows better scalability primarily due to three reasons -a unified dataflow architecture, a balanced workload, and simplified interconnects. To be specific, even though AWB-GCN resolves the workload balance, it relies on an all-to-all network to redirect computations to underutilized computation resources. The disjoint aggregation and update engines of all the prior works require a complicated network to ensure data movement, which all suffer from bandwidth or latency issues. SCALE eliminates the need for complicated networks by having the proposed degree and vertex-aware scheduling and coherent dataflow. Given this, SCALE shows the best improvement for Nell, as it exhibits high irregularity in the graph structure and large feature length exacerbating workload imbalance in baseline accelerators. Even for such irregular graphs, SCALE shows good workload balance and a high degree of parallelism with a large accelerator size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Workload Balance Analysis</head><p>To evaluate the proposed workload balancing techniques, we analyze the PE utilization of SCALE and compare it with the PE utilization of state-of-the-art GNN and GCN accelerators, FlowGNN (FG) and AWB-GCN (AWB). Note that all accelerators are configured with 1K MACs. We use performance measurement counters to measure the active cycles of the PEs during both the aggregation and update phase of all the models and report the average utilization.   <ref type="figure">13(a)</ref> shows SCALE efficiently balances the workload in both phases with an average PE utilization of 98.7% and 97.3% in aggregation and update, respectively. FlowGNN employs a scheduling similar to vertex-aware scheduling, which shows an average PE utilization of 99.1% in the update phase and 62.8% in the aggregation phase. AWB-GCN employs a runtime workload balancing scheme to distribute the workload evenly across the PEs, achieving a utilization close to SCALE with a PE utilization of 86.4% for the aggregation phase and 88.5% for the update phase on average. However, AWB-GCN's runtime balancing policy cannot be directly used for other GNN models that cannot be represented as SpMM. In contrast, SCALE, with the proposed scheduling, can balance the workload in both phases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Ablation Study of Scheduling Policy</head><p>We include an ablation study of various scheduling policies, such as degree-aware scheduling (S+DS), vertex-aware scheduling (S+VS), and degree and vertex-aware scheduling (S+DVS). Fig. <ref type="figure">13(b</ref>) shows that S+DVS achieves high PE utilization in both phases. On the other hand, S+DS has a high utilization of 99.1% in the aggregation phase but a lower utilization of 58.7% in the update phase. This is due to the even distribution of edges, which results in a balanced workload in the aggregation phase. S+VS has a high utilization of 99.2% in the update phase but a lower utilization of 54.7% in the aggregation phase due to the uneven distribution of edges. However, either scheduling is inefficient in balancing the workload in both phases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Sensitivity Study of Ring Size</head><p>To study the effects of varying ring sizes on performance, we perform a sensitivity study in a 2-layer GCN model using Cora and PubMed Datasets, where the scheduling policy is consistent across all configurations. Fig. <ref type="figure">14</ref> shows that the performance varies across datasets. For example, the first layer of the GCN model would prefer a ring size of 64. This is because a small ring size may not be able to accommodate the entire weight matrix, resulting in excessive off-chip memory access. Conversely, a ring size that is too large may require a longer initial data load time and could result in a less balanced workload. For the second layer in Cora and PubMed datasets, the dimension of the weight matrix is 16&#215;7 and 16&#215;3, respectively. In such a case, the update phase would suffer from significant PE under-utilization because of the small weight size. To solve this, we configure SCALE with multiple small rings and duplicate the weight matrix, thereby increasing spatial parallelism.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Task Scheduling Overhead Analysis</head><p>Fig. <ref type="figure">16</ref>(a) shows the ratio of task scheduling latency (t ts ) and task aggregation latency (t agg ) for various batch sizes. The task scheduling overhead is negligible when t ts /t agg &lt; 1 (TS-Negligible) and the bottleneck when t ts /t agg &gt; 1 (TS-Bound). We observe that the aggregation engines transition from TS-Bound to TS-Negligible as the batch size increases and graphs with a low feature length or low average degree require a larger batch size to cover the overhead. When batch size is above 500, t ts is smaller than t agg for all the datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Energy and Area Analysis</head><p>Fig. <ref type="figure">15</ref> shows the energy breakdown of SCALE compared to baselines normalized to AWB-GCN. SCALE reduces the average DRAM and Global Buffer energy consumption by 36.8% and 53.2%, respectively.</p><p>The energy reduction results are from higher intermediate data reuse at the register level, thereby reducing read/write operations to Global Buffer and DRAM. This allows for efficient usage of on-chip storage leading to higher input data reuse. The intermediate data reuse will increase the number of read/write at the register level in SCALE, as such it exhibits a 5.72&#215; Local Buffer energy consumption on average as compared to prior works. Meanwhile, Reddit presents a Cora CiteSeer PubMed Nell Reddit</p><p>Figure <ref type="figure">15</ref>: Energy Breakdown. large portion of mutually shared vertices, which translates to a higher reduction in DRAM and Global Buffer access in ReGNN. For Nell, with a large feature length, the benefits of intermediate data reuse become more significant. Fig. <ref type="figure">16(b)</ref> shows the breakdown of the total die area of SCALE. Storage components like Global Buffer and Local Buffer account for 81.4% of the total area, whereas the MACs and Task Control occupy 12.2% and 6.4% of the total die area, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. GCN Accelerators</head><p>GCN is one of the most prevalent GNN variants which allows the application of convolutional layers directly on graph-structured data. A variety of accelerators have been developed to enhance GCN performance. For example, AWB-GCN <ref type="bibr">[2]</ref> employs a runtime workload rebalancing scheme to redistribute uneven workload through an all-to-all network. GCNAX <ref type="bibr">[1]</ref> exploits loop fusion and reordering to unify the computation characteristics of both GCN phases while reducing off-chip memory access. I-GCN <ref type="bibr">[22]</ref> employs a breadthsearch algorithm to extract dense matrix regions, thus improving computation efficiency and data locality. GCoD <ref type="bibr">[57]</ref> decouples dense and sparse regions of the adjacency matrix and accelerates them in different computing engines. RE-FLIP <ref type="bibr">[58]</ref> and PIM-GCN <ref type="bibr">[59]</ref> accelerate GCN in the form of matrix-vector multiplication in crossbar-based processing-inmemory (PIM) architecture. However, prior research mostly focuses on analog computing, and their proposed design is unable to support complex GNN models with search and comparison functions. Furthermore, these GCN accelerators employ spatial or SIMD architectures adapting to graph irregularity, so intermediate and partial results are hardly reused at the register level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Graph Quantization, Sampling, and Pruning</head><p>To further enhance GNN model performance, various optimization techniques have been proposed. DBQ <ref type="bibr">[60]</ref> introduces a degree-based quantization, in which only insensitive nodes are quantized with low precision. MEGA <ref type="bibr">[61]</ref> proposes a mixed-precision quantization method depending on vertex's in-degree. GNNSampler <ref type="bibr">[62]</ref> exploits the data locality among nodes and their neighbors, eliminating irregular memory access. DyGNN <ref type="bibr">[63]</ref> proposes EdgePrune and VertexPrune to eliminate redundant computations caused by message aggregation. PruneGNN <ref type="bibr">[64]</ref> develops a dimension-pruning-aware </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Sparse Tensor Algebra Accelerators</head><p>Given that the aggregation phase of several GNN models can be abstracted as SpMM, several accelerator architectures have been proposed to facilitate SpMM computations. OuterSpace <ref type="bibr">[65]</ref> proposes an outer product-based SpMM to achieve higher input reuse compared to an inner productbased method. ExTensor <ref type="bibr">[66]</ref> proposes a novel approach to eliminate all the unnecessary computations associated with zero elements. SIGMA <ref type="bibr">[67]</ref> adopts a Benes network to dynamically pair non-zero elements and distribute them to all the multipliers. Flexagon <ref type="bibr">[68]</ref> exploits a flexible dataflow design for Sparse-Sparse Matrix Multiplication (SpMSpM) adapting to various matrix dimensions. However, prior works typically are inefficient in handling GNN models due to the following issues. First, the mentioned sparse tensor accelerators typically target SpMM or SpMSpM with lower sparsity, whereas graph data exhibits a much higher sparsity ratio. In addition, the intermediate data reuse is not considered in the sparse tensor accelerators, as they are optimized only for SpMM, not for chained reduce and matrix multiplications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IX. CONCLUSION</head><p>In this paper, we propose an elastic accelerator, SCALE, that can support a wide range of GNN models and graph irregularities with much-improved performance and energy efficiency. Specifically, SCALE consists of three designs, a novel systolic array-like architecture, a degree and vertexaware scheduling, and a new dataflow tailored for fused graph and neural operations. The proposed systolic array-like architecture is robust to edge and vertex variations and can accommodate edge, vertex, feature, and operator parallelism simultaneously. The degree and vertex-aware scheduling can alleviate the workload imbalance issues in the message aggregation and vertex update phases. Additionally, the proposed dataflow can unify the data movement of both graph and neural operators without extra communication overheads. Our simulation results show that SCALE achieves 1.82&#215; speedup with 38.9% energy reduction on average over the state-of-theart GNN accelerators.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Central Florida. Downloaded on September 09,2025 at 03:49:09 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
