<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>LoAS: Fully Temporal-Parallel Dataflow for Dual-Sparse Spiking Neural Networks</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>11/02/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10613291</idno>
					<idno type="doi">10.1109/MICRO61859.2024.00084</idno>
					
					<author>Ruokai Yin</author><author>Youngeun Kim</author><author>Di Wu</author><author>Priyadarshini Panda</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Spiking Neural Networks (SNNs) have gained significant research attention over the past decade due to their potential for enabling resource-constrained edge devices. While existing SNN accelerators efficiently process sparse spikes with dense weights, the opportunities for accelerating SNNs with sparse weights, referred to as dual-sparsity, remain underexplored. In this work, we focus on accelerating dual-sparse SNNs, particularly on their core operation: sparse-matrix-sparse-matrix multiplication (spMspM). Our observations reveal that executing a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) results in suboptimal efficiency. The main challenge is that SNNs, which naturally processes multiple timesteps, introducing an additional loop in ANN spMspM, leading to longer latency and more memory traffic. To address this issue, we propose a fully temporal-parallel (FTP) dataflow that minimizes data movement across timesteps and reduces the end-to-end latency of dual-sparse SNNs. To enhance the efficiency of the FTP dataflow, we introduce an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. Additionally, we propose an FTP-friendly inner-join circuit that reduces the cost of expensive prefix-sum circuits with negligible throughput penalties. These innovations are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. Running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to 8.51×) and energy reduction (up to 3.68×) compared to prior dual-sparse accelerators.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Spiking Neural Networks (SNNs) have attracted considerable interest as potential energy-efficient alternatives to Artificial Neural Networks (ANNs) <ref type="bibr">[5]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[41]</ref>. Inspired by biological neurons, SNNs utilize highly sparse unarycoded <ref type="bibr">[52]</ref> ({0,1}) spikes to compute and communicate information. Consequently, running SNNs on hardware significantly reduces computation and data movement, making them well-suited for edge computing. As a result, SNNs have been widely employed in computer vision tasks such as image classification <ref type="bibr">[44]</ref>, <ref type="bibr">[54]</ref>, optical flow estimation <ref type="bibr">[27]</ref>, semantic segmentation <ref type="bibr">[21]</ref>, and object detection <ref type="bibr">[20]</ref>.</p><p>Opportunity. With the growing demand for edge devices with limited memory capacity, recent research on SNNs has This work was supported in part by CoCoSys, a JUMP2.0 center sponsored by DARPA and SRC, the National Science Foundation (CAREER Award, Grant #2312366, Grant #2318152), the DoE MMICC center SEA-CROGS (Award #DE-SC0023198), and the University of Central Florida. The FTP dataflow is compared with prior dataflow designs for SNNs. The temporal-sequential tick-batch approach is adapted from SpinalFlow <ref type="bibr">[35]</ref>, while the partially temporal-parallel approach is adapted from PTB <ref type="bibr">[28]</ref>. Each arrow loop represents the processing of one timestep, and the vertical line indicates parallel processing.</p><p>underscored the importance of dual-sparsity, where both spikes and weights are sparse, which can be achieved through pruning techniques <ref type="bibr">[5]</ref>, <ref type="bibr">[22]</ref>. Pruning weight connections in SNNs has been explored both during training <ref type="bibr">[4]</ref>, <ref type="bibr">[47]</ref> and inference <ref type="bibr">[37]</ref>. Some studies have achieved approximately 98% weight sparsity and 90% spike sparsity for SNNs at iso-accuracy <ref type="bibr">[22]</ref>, <ref type="bibr">[57]</ref> by leveraging the lottery ticket hypothesis <ref type="bibr">[13]</ref>. These works highlight the potential of dual-sparse SNNs to achieve unprecedented energy efficiency and memory footprint savings with minimal to no loss in accuracy.</p><p>Challenge. Despite the algorithmic advancements in dualsparse SNNs, hardware development has not yet caught up to fully exploit such dual-sparsity. Generally, existing SNN accelerators can be classified into two main categories. First, multi-core neuromorphic systems<ref type="foot">foot_0</ref> employ a plethora of cores, and even chips, leverage the inherent parallelism in spiking neuron dynamics <ref type="bibr">[1]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[46]</ref>. While these systems are capable of capturing the extensive parallelism and sparse activity among neurons, they require all neurons (including all weight connections) to be mapped on-chip. This approach inevitably leads to a significant waste of hardware resources on neurons that are not involved in any computations due to dual-sparsity <ref type="bibr">[39]</ref>. Second, dataflow-based SNN accelerators are inspired by dataflow-oriented ANN accelerators, taking advantage from the extensive data reuse across an array of</p><p>TABLE I COMPARISON OF LOAS WITH PRIOR SNN ACCELERATORS. S AND T DENOTE THE SPATIAL AND TEMPORAL DIMENSIONS, RESPECTIVELY. SPATIAL PARALLELISM REFERS TO PE-LEVEL PARALLELISM. Accelerator Spike Weight Parallel Neuron Sparsity Sparsity support support SpinalFlow [35] &#10004; &#10008; S LIF [8] PTB [28] &#10004; &#10008; S+partial-T LIF Stellar [32] &#10004; &#10008; S+fully-T FS [50] LoAS (ours) &#10004; &#10004; S+fully-T LIF</p><p>processing elements <ref type="bibr">[28]</ref>, <ref type="bibr">[32]</ref>, <ref type="bibr">[35]</ref>. However, these designs have primarily focused on processing dense SNN workloads. Currently, there is a lack of dataflow architectures specifically designed to target dual-sparsity in SNNs. Table <ref type="table">I</ref> summarizes existing dataflow SNN accelerators.</p><p>Insight. Although operands in dual-sparse SNNs have varying bitwidths, their interactions follow the pattern of sparsematrix-sparse-matrix multiplication (spMspM), a concept extensively studied in ANNs <ref type="bibr">[9]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[38]</ref>- <ref type="bibr">[40]</ref>, <ref type="bibr">[49]</ref>, <ref type="bibr">[60]</ref>, <ref type="bibr">[62]</ref>. However, running dual-sparse SNNs on existing spMspM accelerators is inefficient for several reasons. First, the presence of timesteps in SNNs complicates the dataflow design for existing spMspM accelerators. In ANNs, spMspM operations are typically implemented as triple-nested for-loops <ref type="bibr">[39]</ref>, <ref type="bibr">[53]</ref>, with different spMspM dataflows being derived by permuting the order of these loops. However, in SNNs, the inclusion of timesteps adds an additional level of looping, resulting in increased latency and memory traffic. Furthermore, this extra loop introduces dataflow dependencies and effectively doubles the dataflow design space, thereby delaying time-to-solution. Second, the asymmetric bitwidths of spikes and weights in SNNs make it inefficient to employ conventional compression formats used in ANN spMspM accelerators. Existing ANN spMspM accelerators typically store sparse matrices using popular compression formats like compressed sparse row (CSR), which require multiple bits to encode the coordinates of non-zero values. Consequently, using multiple bits to compress single-bit spikes (which can only be 0 or 1) is highly inefficient in the context of dualsparse SNNs.</p><p>Proposal. To address these challenges and unleash the potential of dual-sparse SNNs in the context of spMspM, we propose a fully temporal-parallel (FTP) dataflow, as illustrated in Figure <ref type="figure">1</ref>. FTP dataflow parallelizes all timesteps, thereby avoiding complex dataflow dependencies and minimizing both latency and memory traffic. To further enhance the efficiency of FTP dataflow in terms of memory and computation, we introduce an FTP-friendly spike compression mechanism and an inner-join mechanism. The proposed compression method packs spikes across timesteps, allowing for contiguous memory access. Additionally, the proposed inner-join mechanism nearly halves the cost of the cumbersome prefix-sum circuits while maintaining throughput comparable to prior inner-join designs.</p><p>To validate the effectiveness of FTP dataflow, we developed LoAS, a Low-latency Inference Accelerator for Dual-Sparse Spiking Neural Networks. Our contributions are outlined below: 1) We observe that SNNs with rich dual-sparsity from both input spikes and weight connections perform suboptimally on existing hardware. Current SNN hardware does not support dual-sparsity, while ANN spMspM hardware struggles to efficiently process timesteps in SNNs, leading to high latency and memory traffic. 2) To enhance the efficiency of processing timesteps, we propose a fully temporal-parallel (FTP) dataflow. FTP eliminates extra memory traffic across timesteps and minimizes the latency associated with processing timesteps sequentially. 3) To fully leverage FTP, we introduce FTP-friendly spike compression for efficient, contiguous memory access, and an FTP-friendly inner-join mechanism that offers low-cost computation with negligible latency penalties. 4) We developed LoAS, a novel architecture that embodies the FTP dataflow. With FTP dataflow and both FTPfriendly compression and inner-join, LoAS achieves significant speedup and energy efficiency compared to other sequential spMspM baselines. The remainder of this paper is organized as follows: Section II reviews the background and justifies the motivation. Sections III and IV articulate our proposed FTP dataflow and LoAS architecture. Sections V and VI then evaluate our design. Finally, Sections VII and VIII present our discussion and conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND AND MOTIVATION</head><p>A. Preliminary of SNNs 1) Leaky-Integrate-and-Fire Neuron: The Leaky-Integrateand-Fire (LIF) neuron is a classical neuron model <ref type="bibr">[8]</ref>, widely adopted in prior SNN research <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[61]</ref>, <ref type="bibr">[63]</ref> due to its biological plausibility and high accuracy. In this work, we focus on accelerating the workloads of dual-sparse SNNs that utilize LIF neurons. During inference, each layer has a sparse input spike tensor A &#8712; U M &#215;K&#215;T , where U &#8712; 0, 1, and a sparse weight matrix defined as B &#8712; Z K&#215;N . Here, T represents the number of total timesteps, while M , N , and K correspond to the spatial dimensions of the input and weight matrix, respectively. The behavior of an SNN layer can be described as follows:</p><p>Step 1: Sparse Matrix Multiplication Sparse matrix multiplication (spMspM) is performed across all timesteps to obtain the full output matrix</p><p>where t i represents the current timestep.</p><p>Step 2: LIF Firing LIF neurons take a snapshot of O at timestep t i and generate a snapshot of the output spike tensor C &#8712; U M &#215;N &#215;T for the current timestep t i : where</p><p>is the membrane potential that carries over temporal information from the previous timestep t i-1 , and v th is the firing threshold, a predefined scalar value.</p><p>Step 3: Membrane Potential Update After the output spikes are generated, the membrane potential is updated to carry residual information to the next timestep, according to the following equation:<ref type="foot">foot_1</ref> </p><p>where &#964; &#8712; (0, 1) is the leaky factor. From the above equations, we observe that generating the output spike matrix C for timestep t i requires information from the previous timestep</p><p>. This introduces temporal dependency between output spike matrices across timesteps. The behavior of a LIF neuron is illustrated in Figure <ref type="figure">2</ref>.</p><p>2) Spike Encoding and SNN Training: A critical step in utilizing SNNs for conventional machine learning tasks is encoding the input source data (e.g., image pixels) into spike trains across multiple timesteps. These input spike trains are then sequentially fed into the SNN for processing. Recent SNN research has adopted direct encoding (a special form of rate encoding) to achieve high accuracy on conventional computer vision tasks with very few timesteps (&#8804;4) <ref type="bibr">[22]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[55]</ref>, <ref type="bibr">[57]</ref>, <ref type="bibr">[63]</ref>. In this work, we focus on accelerating dual-sparse SNNs that use direct encoding. These SNNs are trained using backpropagation-through-time (BPTT) <ref type="bibr">[51]</ref> with surrogate gradients <ref type="bibr">[36]</ref>, achieving performance very close to that of ANNs on many complex tasks <ref type="bibr">[55]</ref>, <ref type="bibr">[63]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Distinctive Features and Challenges of SNNs</head><p>Several distinctive features make SNNs well-suited for lowpower edge deployment with a potential challenge.</p><p>Feature 1: Unary Activation One of the most distinctive features of SNNs is their unary spike activation. Specifically, SNNs utilize single-bit, non-weighted activations to propagate information through layers. The primary advantage of unary activation is the simplified, low-power arithmetic units it enables. As illustrated in Figure <ref type="figure">2</ref>, compared to the multiplyaccumulate (MAC) operations in ANNs, SNNs only require simple bitwise-AND and accumulate (AC) operations. <ref type="foot">3</ref> Without the need for expensive multipliers <ref type="bibr">[16]</ref>, the computations in SNNs demand extremely low power and area.</p><p>Feature 2: Sparse Spike Activity Another key feature of SNNs is their highly sparse spike-firing activity. In ANNs, after the MAC operations are completed, a ReLU unit filters out non-positive outputs. In contrast, SNNs use an AC operation, followed by the Leaky-Integrate-and-Fire (LIF) unit, which only fires (producing an output of 1) when the input exceeds a preset threshold. Consequently, the output sparsity in SNNs is typically much higher (around 90%) <ref type="bibr">[57]</ref>, <ref type="bibr">[58]</ref>, <ref type="bibr">[61]</ref>, <ref type="bibr">[63]</ref> compared to that in ReLU-based ANNs (around 50%) <ref type="bibr">[39]</ref>, <ref type="bibr">[43]</ref>. This higher sparsity translates into greater computation and memory savings in the context of spMspM acceleration.</p><p>Challenge: Repeated Timesteps Despite the aforementioned hardware-friendly features, a significant challenge in deploying SNNs on hardware is their intrinsic reliance on repeated timesteps. A timestep is the smallest discrete unit of time in SNNs. <ref type="foot">4</ref> Within each timestep, every neuron must complete the AC operations for all non-zero inputs, fire a spike if necessary, and update its membrane potential. To capture the temporal dynamics of the input data, SNNs must operate across multiple timesteps, as illustrated in Figure <ref type="figure">2</ref>. However, running across multiple timesteps increases latency and reduces energy efficiency, diminishing the benefits of low-power circuits unless a specialized architecture is employed <ref type="bibr">[35]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. spMspM Dataflows in SNNs</head><p>There are various ways to map spMspM onto hardware, each with distinct efficiency <ref type="bibr">[30]</ref>, <ref type="bibr">[33]</ref>. Three different spM-spM dataflows have been proposed in existing dual-sparse ANN accelerators: Inner-product (IP) <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[40]</ref>, Outer-product (OP) <ref type="bibr">[9]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[39]</ref>, <ref type="bibr">[62]</ref>, and Gustavson's (Gust) <ref type="bibr">[49]</ref>, <ref type="bibr">[60]</ref>. Figure <ref type="figure">3</ref> illustrates these three dataflows in SNNs for two input matrices, A and B, and an output matrix, C. Their abstract loop nests are shown on the righthand side. As discussed in Section II-B, it is essential to account for timesteps when performing spMspM operations in SNNs. Inside the black box in Figure <ref type="figure">3</ref>, the dataflow corresponds to one timestep and is thus identical to ANN dataflow. Outside the black box, the multiple input matrices A (blurred) represent the input spike matrices across different timesteps. Simultaneously, multiple output spike matrices C, which have temporal dependencies, are generated. To accommodate the timesteps in SNNs, an additional loop dimension (t dimension) must be considered in the original triple-nested for-loop. The t dimension (highlighted in the blue box) introduces temporal dependency to each output pixel in SNNs. For example, when processing an SNN using the IP dataflow, as shown in Figure <ref type="figure">3</ref>, we first calculate the output cell at position</p><p>Comparison of different spMspM dataflows for SNNs. For illustration, C[t i ] represents the spMspM result between A[t i ] and B, similar to spMspM in ANNs. In reality, an additional LIF step (Equation (2)) is required to obtain C[t i ]. Circled numbers indicate the computation order for each spMspM dataflow. The t dimension is fixed here for illustration; in practice, 16 possible permutations of spMspM dataflow in SNNs exist, discussed in Section III.</p><p>(0,0) for timestep 0 (C[0,0,0]). Then, instead of moving to position (0,1), we proceed to calculate the output cell at (0,0) for timestep 1 (C[0,0,1]). Since the output cell C[0,0,1] is temporally dependent on the result of C[0,0,0], we must process C[0,0,0] before C[0,0,1].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. ANN spMspM Hardware for Dual-Sparse SNNs</head><p>We review existing ANN spMspM accelerators to understand why naively running dual-sparse SNNs on these accelerators is sub-optimal.</p><p>Inner-join Design: For the IP dataflow, prior accelerators typically employ an inner-join-based design <ref type="bibr">[9]</ref>, <ref type="bibr">[15]</ref>. In these designs, non-zero values in the rows of matrix A and the columns of matrix B are compressed using bitmask representation, where a bit string has 1's for positions with nonzero values and 0's otherwise. An inner-join unit scans two bitmasks on the fly to determine if there is a matching position (where both multiplicands are non-zero) and then sends the matched pairs to the compute units. Running dual-sparse SNNs on an inner-join-based design does not require additional bitmasks for the input spike matrix A, as the unary spike train itself can be viewed as a bitmask. However, as shown in Figure <ref type="figure">4</ref>, the timesteps introduce multiple extra rounds of running the expensive inner-join units (which can occupy roughly 46% of the system-level power <ref type="bibr">[15]</ref>), leading to high energy costs. Furthermore, since the spike trains are used as bitmasks, all spikes-whether 1 or 0-must be fetched from off-chip DRAM, resulting in no memory traffic savings for the sparse spike matrix A.</p><p>Merger-based Design: Unlike IP dataflow designs that exhibit full output matrix (C) reuse, OP and Gust dataflow designs focus on reusing input matrices A and B. In OP, each column of A and each row of B are traversed only once, leading to efficient input data reuse. However, only one partial sum is generated at a time and is merged later. While these two dataflows achieve better data reuse for the</p><p>U Bitmask B Original data A data-A data-B Original data B 3 9 6 5 3 6 : Inner-join U ~45% System Power * Inner-join Design (ANN) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AlexNet-L1</head><p>VGG16-L8 ResNet19-L8 Fig. <ref type="figure">5</ref>. Off-chip traffic of partial sum matrices across different SNN layers, with SNNs running on GoSPA <ref type="bibr">[9]</ref>, an OP dataflow spMspM accelerator, using 1 and 4 timesteps.</p><p>input matrices, the partial sum matrices (rows) potentially cause increased off-chip data traffic. To mitigate the high memory traffic associated with partial sums, some designs implement large and costly mergers (e.g., 38&#215; more area than multipliers <ref type="bibr">[62]</ref>) to merge as many partial sum matrices (rows) as possible before sending them back to off-chip DRAM. Due to the additional t dimension, running dual-sparse SNNs on a merger-based design either requires a more complex merger capable of handling the extra partial sum traffic or results in increased off-chip memory traffic. As shown in Figure <ref type="figure">5</ref>, for a timestep of four, on average, 4&#215; more partial sum traffic is induced compared to a single timestep.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Dataflow Architecture for SNNs</head><p>SpinalFlow: Temporal-Sequential Design. SpinalFlow <ref type="bibr">[35]</ref> is the first SNN-specific accelerator designed to exploit the</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Systolic-array of PEs</head><p>Fig. <ref type="figure">6</ref>. Example of PTB's partially temporal-parallel design. Each column of the PE array processes a time-window of multiple timesteps. Time-windows run in parallel, but timesteps within each window are processed sequentially.</p><p>efficiency of single-bit activation and extremely sparse spike activity. The authors identified the challenge of sequentially processing the entire SNN network through timesteps. To address this challenge, SpinalFlow processes all timesteps for one layer before moving on to the next layer, as illustrated in Figure <ref type="figure">1</ref>. SpinalFlow dispatches LIF neurons across different processing elements (PEs) and parallelizes the computation. However, within each layer, the timesteps are still processed sequentially, as shown in Figure <ref type="figure">1</ref>. SpinalFlow is optimized specifically for temporally-coded SNNs, which may lag in accuracy performance compared to rate-coded SNNs <ref type="bibr">[28]</ref>. In this work, we focus on accelerating spMspM for general ratecoded SNNs, which achieve accuracy competitive with ANNs across various tasks.</p><p>PTB: Partially Temporal-Parallel. While SpinalFlow's design is tailored to temporally-coded SNNs, PTB <ref type="bibr">[28]</ref> proposes a general architecture design for rate-coded SNNs. By leveraging the high data-reuse pattern across different PEs in a systolic array architecture <ref type="bibr">[26]</ref>, PTB breaks the processing of all timesteps into multiple time-windows, each consisting of several contiguous timesteps, and runs these time-windows in parallel, as shown in Figure <ref type="figure">1</ref>. PTB parallelizes the mapping of multiple time-windows across different columns of the systolic array, while the computation of different LIF neurons is parallelized across the rows of the array. This hardware mapping strategy is illustrated in detail in Figure <ref type="figure">6</ref>. Although PTB attempts to parallelize the processing of timesteps, the parallelization occurs at the granularity of the time-window. Within each time-window (column of PEs), timesteps are still processed sequentially. Consequently, we categorize PTB as a partially temporal-parallel design. One unique aspect of LoAS compared to PTB is that LoAS leverages a different loop ordering (Section III and VII), enabling comprehensive optimizations.</p><p>Prior SNN accelerators with LIF neurons process timesteps in a sequential or partially parallel manner. As discussed in Sections II-C and II-D, it is challenging for these existing SNN designs to achieve high performance in spMspM SNN acceleration. Therefore, a spMspM-friendly strategy for processing timesteps is needed.</p><p>Stellar: Temporal-Parallel with non-LIF Neurons. Stellar <ref type="bibr">[32]</ref> is another systolic array SNN accelerator that attempts to process timesteps in a fully parallel manner. However, Stel-Algorithm 1 Fully Temporal-Parallel Dataflow (FTP) Input:</p><p>for n &#8712; N do 3:</p><p>end for parallel-for t &#8712; T do &#9655; Spatially unrolled 7:</p><p>end for 9: end for lar focuses on optimizing for the Few Spikes (FS) neuron <ref type="bibr">[50]</ref>, as shown in Table <ref type="table">I</ref>. FS neurons differ from LIF neurons by decoupling the spike accumulation and firing stages. As a result, FS neurons naturally lack temporal dependency during the spike accumulation stage, making fully parallel temporal processing straightforward in Stellar. In contrast, as discussed in Section II-A, temporal dependency is inherent in the input data for LIF neurons, which makes their design space different from that of Stellar for fully temporal-parallel processing. Unlike the widely adopted LIF neurons, supporting FS neurons requires non-trivial algorithm-hardware co-design, which is beyond the scope of this work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. FULLY TEMPORAL-PARALLEL DATAFLOW</head><p>We propose a fully temporal-parallel dataflow (FTP) designed to mitigate the negative effects of repeatedly processing timesteps on spMspM accelerators (Section II-D). The proposed FTP is outlined in Algorithm 1.</p><p>An SNN-friendly spMspM dataflow should satisfy three key objectives: (1) minimize data refetching across timesteps;</p><p>(2) generate as few partial sums as possible in the temporal dimension (timesteps); and (3) reduce latency in the temporal dimension to minimize the additional cost of sparsity-handling units.</p><p>Our first observation is that for all three spMspM dataflows (Section II-C), unless the temporal dimension (t-dim) is placed at the innermost loop, it will result in at least T times more data refetching in the dimensions below, compared to the original dataflow. For example, in OP, if the t-dim is placed between m and n, T times more access to B's rows is required. If the t-dim is placed between k and m, T times more access to A's columns and B's rows is required. Depending on the on-chip buffer capacity, repeated memory access may lead to more expensive off-chip memory access, which contradicts objective <ref type="bibr">(1)</ref>.</p><p>Our second observation is that both OP and Gust dataflows are unsuitable for dual-sparse SNNs because they conflict with objective (2). In the OP dataflow, we find that regardless of where the t-dim is inserted into the original triple-nested loop, it always produces T times more partial sum matrices compared to the original OP dataflow. These partial sums need to be stored in an on-chip cache until all partial sums along both the spatial (k) and temporal dimensions (t-dim) are accumulated, which adds extra memory overhead in OP. The same issue arises in the Gust dataflow. The t-dim will either generate T times more partial sum rows or require T times more access to both the k and n dimensions. Our final observation is that, regardless of the position of the t-dim, processing it sequentially will always result in T times more processing latency, which conflicts with objective (3).</p><p>Our solution is straightforward yet effective. We position the t-dim at the innermost level of the IP dataflow and fully parallelize it, as detailed in Algorithm 1. This design choice offers several advantages. Firstly, placing the t-dim at the innermost loop ensures that no additional data movement is incurred, satisfying objective <ref type="bibr">(1)</ref>. Secondly, since the IP dataflow efficiently reuses outputs, no extra partial sums are generated along the t-dim, addressing objective <ref type="bibr">(2)</ref>. Lastly, by fully parallelizing the t-dim, we eliminate the latency associated with sequentially processing timesteps, achieving objective (3). This is equivalent to transforming the for-loop of t-dim into a parallel-for loop <ref type="bibr">[53]</ref>. The parallel-for loop parallelizes operations across different spatial instances, requiring minimal hardware overhead due to the duplication of only inexpensive accumulators, and the number of timesteps in direct-coded SNNs is small (Section II-A). We later demonstrate in the ablation studies that FTP scales well with increasing timesteps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. LOAS</head><p>An overview of LoAS is shown in Figure <ref type="figure">7</ref>. LoAS consists of multiple temporal-parallel processing elements (TPPEs) and parallel Leaky-Integrate-Fire units (P-LIFs) that are specifically designed to execute the FTP dataflow. It also includes a scheduler that distributes workloads across the TPPEs and a compressor that compresses the output spikes from the P-LIFs and writes them back to the on-chip memory. An on-chip SRAM is integrated to facilitate data reuse. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Spikes Compression</head><p>We first discuss how sparse input spikes (matrix A) across timesteps are compressed in LoAS. Efficiently compressing matrix A in SNNs involves addressing two key challenges:</p><p>How to maximize compression efficiency of 1-bit spikes? Assume that the input spike matrix A has a size of 128&#215;128 for each timestep. In either CSR or CSC format, two 7-bit coordinates are needed to compress each 1-bit non-zero spike. <ref type="foot">5</ref>Furthermore, since SNNs naturally operate across multiple timesteps, different spike values may occur at the same coordinate across different timesteps (e.g., 0 for T=1&amp;3, and 1 for T=2&amp;4). To accurately capture all non-zero spikes, separate coordinate values would be required for each timestep.</p><p>How to maintain contiguous memory access of non-zero spikes across timesteps? The FTP dataflow proposed in Section III requires spatial unrolling of the input spike matrix A across all timesteps beneath the k dimension. Consequently, a non-contiguous memory layout of A along the t dimension would lead to fragmented memory access at all levels of the memory hierarchy, resulting in higher data movement costs.</p><p>To illustrate these two challenges, consider the example in Figure <ref type="figure">8</ref>. Suppose that the input spikes sent to the system have the pre-synaptic neuron a 0,0 (the first element of row-0 in matrix A) firing a spike at t 0 and t 2 . As shown in step 1 , to represent this pre-synaptic neuron behavior, a single-bit 1 needs to be stored at row-0, column-0 of matrix A for both timesteps 0 and 2, as depicted in the "unpacked real data" box. For each non-zero spike in row-0 of matrix A at each timestep, if a coordinate value (e.g., 4-bit for CSR) is used to record its position, then 2 &#215; 4 = 8 bits are needed to compress 2 bits (2 spikes). The compression efficiency in this case is only 25%. Furthermore, memory access to spikes across different timesteps is discontinuous (sequentially accessing different t rows of the spatial row-0 of A).</p><p>We propose the following spike compression format for LoAS to address these two challenges. As shown in step 2 , we pack all the spikes (both 0 and 1) across all timesteps into a single continuous data block for each pre-synaptic neuron. In the example in Figure <ref type="figure">8</ref>, we store a 4-bit value 1010 at the first position of row-0 in matrix A for a 0,0 and 0111 at the fourth position for a 0,3 . Since neurons a 0,1 and a 0,2 do not spike at any timestep, their packed value would be 0000 (as shown in the "packed real data" box). We define these neurons as silent neurons. 6 With this strategy, only the nonsilent neurons are treated as non-zero values and stored in memory for matrix A, as shown in step 3 . To accommodate our FTP dataflow, we compress the input spike matrix A in a row-wise manner and use the bitmask format <ref type="bibr">[9]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[40]</ref> to represent the coordinates of the non-zero values. The bitmask format uses a 1-bit coordinate value for each position in the row. In our example, the bitmask is 1001, indicating that the first and fourth elements in the row are non-zero, while the second and third elements are silent neurons (represented by 0 in the bitmask). Following the bitmask, a pointer is stored to indicate the starting location of the non-zero values in the row. We refer to this compressed row as a fiber <ref type="bibr">[33]</ref>, <ref type="bibr">[60]</ref>.</p><p>In our example, we use 4 bits to compress 5 bits (5 non-zero spikes), resulting in a compression efficiency of 125%.</p><p>The key to our compression method is the ratio of silent neurons in the SNN. Fortunately, empirical studies have shown that SNNs have a significant fraction of silent neurons (60%-70%, as shown in Table <ref type="table">II</ref>). We also apply a similar bitmask-based technique to compress weights in a columnwise manner, with each compressed weight column also referred to as a fiber.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Temporal-Parallel Processing Elements</head><p>The fundamental building blocks of LoAS's compute engine are the Temporal-Parallel Processing Elements (TPPEs) and Parallel Leaky-Integrate-Fire units (P-LIFs), which we describe next. Figure <ref type="figure">7</ref> details the design of the TPPE. Each TPPE computes the full sum for one output neuron across all timesteps (Line 5 in Algorithm 1). Before computation begins, the bitmask (bm-B) of a fiber from weight matrix B (fiber-B) and its non-zero data are read from SRAM and broadcast into the small bitmask buffers (128 bits in our design) within each TPPE. The bitmask (bm-A) of the fiber from the input spike matrix A (fiber-A) is also fetched and sent to the TPPEs. Each TPPE holds the bitmask for a distinct fiber along the row of A. Once the data are loaded, an innerjoin operation <ref type="bibr">[9]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref> is performed between the two bitmasks. Depending on the inner-join result, the matched non-zero data of fiber-A are fetched from the global cache and sent to the pseudo-accumulator (to be discussed shortly) to perform the accumulation (AC) operation. After the TPPE completes the full computation for one output neuron, it sends the result to the P-LIF unit, which generates output spikes for all timesteps in one operation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Inner-join Unit</head><p>The inner-join operation has been extensively studied in prior works <ref type="bibr">[9]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref> for spMspM acceleration in ANNs. 6 We follow the same terminology used in <ref type="bibr">[28]</ref>. The inner-join mechanism, coupled with a prefix-sum circuit, has been efficiently implemented using bitmask representation <ref type="bibr">[15]</ref>. In <ref type="bibr">[15]</ref>, a logical AND operation is first applied to two bitmasks to obtain the AND-result, which represents the locations where both data are non-zero. This AND-result is then sent to a priority encoder to convert the matched positions into integer values. The matched positions are subsequently sent to two separate prefix-sum circuits to determine the number of 1s preceding each matched position in each bitmask, which provides the offsets for each non-zero data in memory.</p><p>During this process, the use of two fast prefix-sum circuits<ref type="foot">foot_6</ref> is a costly operation, consuming more than 45% of the power and area in <ref type="bibr">[15]</ref>. To reduce the overhead associated with the prefix-sum circuits, we propose an FTP-friendly inner-join unit, which is detailed in Figure <ref type="figure">9</ref>.</p><p>We first observe that in ANNs, the MAC operation requires both inputs to be explicitly known at computation time. Therefore, two fast prefix-sum circuits are necessary to match the processing speed of both inputs. However, this is not the case with SNNs. In SNNs, the input has only two possible states (1 or 0), meaning we either accumulate or discard the weight. This distinction allows for an imbalanced processing speed between the two inputs at the prefix-sum stage.</p><p>In our design, instead of using two fast prefix-sum circuits as in ANNs, we employ one fast and one laggy prefix-sum circuit (soon be explained), as shown in Figure <ref type="figure">9</ref>. Recall that our compression method only fetches the non-silent neurons (those that fire at least once across timesteps) from DRAM for A. Thus, as soon as we find a matched position in the ANDresult, we can confidently accumulate the corresponding nonzero value in fiber-B at least once, without immediately knowing the exact spike information from fiber-A. This approach ensures that the throughput of processing fiber-B remains high, regardless of the processing speed of fiber-A.</p><p>In our efficient inner-join unit, each time the fast prefixsum circuit generates an offset, the corresponding non-zero value of fiber-B is directly sent to a pseudo-accumulator for accumulation. This mechanism opportunistically assumes that the matched non-zero value in fiber-A is all 1s (i.e., the presynaptic neuron fires at all timesteps) to fully leverage the throughput of the fast prefix-sum circuit. Since the non-zero value in fiber-A is not always all 1s, we need a mechanism to ensure that the accumulation results are accurate. Instead of using the expensive fast prefix-sum circuit to access and verify the matched non-zero value in fiber-A, we use a much simpler circuit to generate the offset for fiber-A. We refer to this simpler prefix-sum circuit as the laggy prefix-sum circuit, illustrated on the left side of Figure <ref type="figure">9</ref>. We use a group of adders to sequentially add up the prefix-sum results and store them in a small buffer. These adders operate in parallel, so the latency of generating all the offsets is equal to len(bm-A) #of adders . We provide a simple walk-through example in Figure <ref type="figure">10</ref>. First, we run the fast prefix-sum circuit; in every cycle, we accumulate the matched non-zero value of fiber-B and buffer it together with the matched position in small FIFOs. When the laggy prefix-sum circuit finishes running, a ready signal is sent out. Next, we check the non-zero value in fiber-A according to the buffered position from FIFO-mp. If the matched value is all 1s, we simply discard the current value in FIFO-B. Otherwise, we send the buffered non-zero values of fiber-B from FIFO-B to the correction accumulators. As illustrated in Figure <ref type="figure">10</ref>, at cycle 4, we check a 2 and find its value is 1111, so we discard b 2 . At cycle 5, we check a 4 and find its value is 1010, so we send b 4 to the correction accumulator for t 1 and t 3 .</p><p>This example highlights the motivation and benefits of using a combination of fast and laggy prefix-sums. By utilizing a fast prefix-sum, we can consume fiber-B as early as possible by first accumulating it into the pseudo-accumulator. While waiting for the laggy prefix-sum to correct the accumulation results, we can proceed to fetch the next fiber-B's data into the buffer. This approach allows the latency of fetching fiber-B to be overlapped with the laggy prefix-sum and correction, thereby improving overall throughput. Additionally, replacing one fast prefix-sum with a laggy one reduces the overall power and area of our TPPE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Other Units</head><p>After the pseudo-accumulator completes its computation, the accumulation results are duplicated and sent to each correction accumulator. The correction value inside each accumulator is then subtracted from the pseudo accumulation results for each timestep. Finally, the corrected results are sent to the P-LIF units to generate the output spikes. As shown in the purple box in Figure <ref type="figure">7</ref>, we spatially unroll the LIF operations so that the output spikes for all timesteps are generated simultaneously.</p><p>LoAS utilizes a unified global buffer to hold the compressed fiber-A and fiber-B along with their bitmask representations. We adopt a FiberCache design <ref type="bibr">[60]</ref>, as a unified shared cache offers better utilization compared to separate ones. Each line in the global cache consists of two parts: the first part contains the bitmask representation of a fiber followed by a pointer, and the second part holds the non-zero values of that fiber. If the line manages to hold all the non-zero values, the pointer will be a NULL pointer; otherwise, it points to the location of the line where the remaining data are stored. Each PE is responsible for generating one output neuron, so we use a highly banked global cache to ensure multiple PEs can access their data concurrently. Within each bank, we fetch as many chunks as possible for one fiber in matrix A and hold them as long as possible to maximize data reuse. This can be achieved by adopting a replacement policy for the global cache, as in <ref type="bibr">[30]</ref>, <ref type="bibr">[60]</ref>. Only one compressed row fiber of matrix B is fetched into the global cache and broadcasted to all TPPEs. We employ a compression unit similar to the one in <ref type="bibr">[15]</ref>, where an inverted prefix-sum circuit is used to compress the output spikes and generate their bitmask representations. As observed in <ref type="bibr">[15]</ref>, this compression step does not need to be performed quickly, so we equip an inverted laggy prefix-sum circuit to handle the compression. The scheduler is responsible for distributing the data to each TPPE through a simple swizzle-switch-based crossbar <ref type="bibr">[45]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTAL METHODOLOGY</head><p>Software Configuration<ref type="foot">foot_7</ref> : For the dual-sparse SNNs, we train and compress AlexNet <ref type="bibr">[25]</ref>, VGG16 <ref type="bibr">[48]</ref>, and ResNet19 <ref type="bibr">[17]</ref>. We use open-source toolchains for lottery-ticket-hypothesis (LTH)-based SNN pruning <ref type="bibr">[13]</ref>, <ref type="bibr">[22]</ref>. The default number of timesteps T is set to 4 across all experiments. We perform 15 rounds of LTH searching, and all SNNs are trained to convergence, achieving accuracy comparable to state-of-theart dense baselines <ref type="bibr">[22]</ref>. Additionally, we select representative layers from each network to provide single-layer insights. A summary of the workloads is provided in Table <ref type="table">II</ref>. We also employ a simple yet effective preprocessing technique: zeroing out all presynaptic neurons that exhibit low firing activity to further increase the number of silent neurons. Specifically, we take the trained SNN and mask the neurons that produce only one output spike across all timesteps. We find that with minimal fine-tuning (&#8804;5 epochs), the original accuracy can be fully recovered, as shown in Figure <ref type="figure">11</ref>. It is important to note that this preprocessing technique is designed to maintain the accuracy of the original workload, not to improve it. During hardware execution, the compressor will discard output neurons that have 0 or only 1 output spike. As</p><p>TABLE II SNN WORKLOADS. NL = NUMBER OF LAYERS. T = TIMESTEPS. AVSP{A, B} = AVERAGE SPARSITY OF MATRICES {A, B} IN (%). AVSPA-ORIGIN = ORIGINAL SPIKE SPARSITY ACROSS TIMESTEPS. AVSPA-PACKED = DENSITY OF SILENT NEURONS. AVSPA-PACKED+FT = DENSITY AFTER FINE-TUNED PREPROCESSING. M/N/K DENOTES MATRIX DIMENSIONS. SNN NL T AvSpA AvSpA AvSpB origin packed(+FT) AlexNet(A) 7 4 81.2 71.3(76.7) 98.2 VGG16(V) 14 4 82.3 74.1(79.6) 98.2 ResNet19(R) 19 4 68.6 59.6(66.1) 96.8 Layer T,M,N,K A-L4 4,64,256,3456 75.8 63.2(69.7) 98.9 V-L8 4,16,512,2304 88.1 76.5(86.8) 96.8 R-L19 4,16,512,2304 57.9 51.4(55.7) 99.1 T-HFF 4,784,3072,3072 --(86.8) 96.8</p><p>shown in Table <ref type="table">II</ref>, this preprocessing effectively increases the number of silent neurons by up to 1.1&#215;.</p><p>Origin Mask FT-e1 FT-e5 FT-e10</p><p>ResNet19 VGG16 Fig. <ref type="figure">11</ref>. Accuracy trends with fine-tuned preprocessing. 'Mask' refers to masking out all presynaptic neurons that fire only once during inference. FT-e{x} denotes fine-tuning for x epochs.</p><p>Hardware Configuration: We evaluate LoAS using the configuration detailed in Table <ref type="table">III</ref>. In our experiments, LoAS is configured to support SNNs running with 4 timesteps. We use 16 TPPEs, each equipped with 5 accumulators (1 12-bit pseudo-accumulator and 4 10-bit correction accumulators) and 1 inner-join unit. Each inner-join unit contains 1 fast prefixsum circuit and 1 laggy prefix-sum circuit. The fast prefixsum circuit can generate offsets in a single cycle, while the laggy prefix-sum circuit, which contains 16 adders and a 128bit buffer, generates offset results in 8 cycles. The TPPE also includes 2 depth-8 FIFOs (for correction purposes) and 2 128bit buffers (for holding bitmasks). Additionally, a 128-byte buffer is integrated into the TPPE to hold the non-zero weights from fiber-B. We allocate 256 KB (double-buffered) for the global cache. For our workloads, this memory size is sufficient to capture good on-chip data reuse and keep all TPPEs busy.</p><p>Baseline: As discussed earlier, there are currently very few spMspM accelerators available for dual-sparse SNNs. To construct our baselines, we take the following approach: We first select three popular ANN spMspM accelerators that use the IP, OP, and Gust dataflows: SparTen <ref type="bibr">[15]</ref>, GoSPA <ref type="bibr">[9]</ref>, and Gamma <ref type="bibr">[60]</ref>. We then envision a scenario where a dualsparse SNN (with 4 timesteps and 8-bit weights) is naively run (sequentially processing its timesteps) on these accelerators.</p><p>To be conservative, we place the t dimension at the innermost loop of the original IP, OP, and Gust dataflows. <ref type="foot">9</ref> We then</p><p>TABLE III CONFIGURATION OF THE LOAS SYSTEM. TPPEs 16 TPPEs, 8-bit weight Inner-join unit 16 Inner-join units Global cache 256 KB, 16 banks, 16-way associative Crossbars 16 &#215; 16 and 16 &#215; 16, swizzle-switch based Main memory 128 GB/s over 16 64-bit HBM channels</p><p>make essential modifications to the three accelerators, such as removing the multipliers in these designs. To ensure a fair comparison, we configure all designs to have 16 PEs and the same global SRAM size. We refer to these three baselines as SparTen-SNN, GoSPA-SNN, and Gamma-SNN. We implemented the key components of LoAS and our hardware baselines in RTL and synthesized them using the Synopsys Design Compiler at 800MHz with 32 nm technology. A 128 GB/s High-Bandwidth Memory (HBM) module is connected to LoAS as the off-chip memory. We used CACTI 7.0 <ref type="bibr">[34]</ref> to model the memory components. Additionally, we built a simulator in Python to model the cycle-level behavior of LoAS and the baselines by tiling and ordering the spMspM loop and mapping it to hardware.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EXPERIMENTAL RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Hardware Evaluation</head><p>Overall Performances: Figure <ref type="figure">12</ref> compares the performance of the three dual-sparse SNN accelerator baselines (SparTen-SNN, GoSPA-SNN, and Gamma-SNN) and LoAS (with and without fine-tuned preprocessing). The speedup is measured with respect to the cycle numbers of SparTen-SNN.</p><p>The first observation is that LoAS significantly outperforms the other three accelerator baselines in all cases, achieving average speedups of 6.79&#215; (vs. SparTen-SNN), 5.99&#215; (vs. GoSPA-SNN), and 3.25&#215; (vs. Gamma-SNN). This is due to LoAS 's use of the FTP dataflow, which eliminates the intra-PE latency penalty associated with sequentially processing timesteps. Additionally, LoAS reduces both on-chip and offchip data communications across timesteps.</p><p>The second observation is that LoAS's performance gain is highly correlated with the sparsity of matrix A. This relationship is expected since our workloads are extremely sparse in matrix B, making the overall computation matrix-A-bounded. As a result, the performance of the three baselines suffers more when sequentially processing timesteps through matrix A with lower sparsity. However, LoAS does not experience this penalty, resulting in speedups ranging from 4.08&#215; (vs. SparTen-SNN) on VGG16 (highest A sparsity) to 8.51&#215; (vs. SparTen-SNN) on ResNet19 (lowest A sparsity).</p><p>Finally, we observe that with the help of preprocessing (removing neurons that spike only once), LoAS further improves performance by an average of 20%. This improvement occurs because the preprocessing technique increases the density of silent neurons (Section IV-A), allowing LoAS to completely avoid data communications and computations associated with those neurons. Figure <ref type="figure">12</ref>   198.7 407 Fig. 13. Off-chip traffic (KB) and on-chip memory traffic (MB) for SparTen-SNN, GoSPA-SNN, Gamma-SNN, and LoAS (with and without preprocessing) across three SNN workloads.</p><p>3.09&#215;, 2.40&#215;), (3.17&#215;, 1.50&#215;, 2.33&#215;), and (3.54&#215;, 1.34&#215;, 2.47&#215;) higher energy efficiency over SparTen-SNN, GoSPA-SNN, and Gamma-SNN on AlexNet, VGG16, and ResNet19, respectively. Detailed Analysis: We now explain the performance gains of LoAS. Thanks to the FTP dataflow, LoAS experiences significantly less on-chip and off-chip memory traffic compared to the three baselines. As shown in Figure 13, compared to SparTen-SNN (IP), LoAS achieves 3.93&#215; (3.70&#215;), 3.57&#215; (2.22&#215;), and 4.07&#215; (2.24&#215;) less on-chip SRAM (off-chip DRAM) access on AlexNet, VGG16, and ResNet19, respectively. This outcome is expected since IP dataflow designs like SparTen are known for poor input data reuse, a problem exacerbated by the extra temporal dimension (t-dim) in SNN workloads. While the FTP dataflow is a variant of the innerproduct dataflow, it does not incur additional executions on the Weight Input Psum Other Normalized SRAM miss rate 16x 4x Fig. 14. Normalized off-chip traffic with breakdown for SparTen-SNN, GoSPA-SNN, Gamma-SNN, and LoAS (with preprocessing) across three SNN layer workloads. The normalized SRAM cache miss rate is also provided for the ResNet19 layer workload. All values are normalized to LoAS.</p><p>t-dim because it parallelizes the t-dim at the innermost loop. Not surprisingly, compared to GoSPA-SNN (OP), LoAS still achieves 2.87&#215; (4.49&#215;), 2.19&#215; (2.78&#215;), and 2.98&#215; (3.03&#215;) less on-chip SRAM (off-chip DRAM) access on AlexNet, VGG16, and ResNet19, respectively. Although OP dataflow designs like GoSPA-SNN are known for excellent input data reuse (on average, GoSPA-SNN has 1.45&#215; less SRAM traffic than SparTen-SNN), their inefficiency arises from the partial sum (psum) matrices. Due to the extra t-dim in SNNs, the size of psum matrices expands with the number of timesteps. GoSPA's design allocates limited on-chip memory for psum, so the psum matrices that cannot fit on-chip must be written to off-chip DRAM and later read back for reduction, leading to significant off-chip memory traffic.</p><p>Finally, compared to Gamma-SNN (Gust), LoAS achieves 2.16&#215;, 1.76&#215;, and 1.91&#215; less DRAM access on AlexNet, VGG16, and ResNet19, respectively. This result aligns with Gust dataflow's ability to reduce off-chip partial row access through on-chip SRAM and mergers. However, while reducing DRAM access, Gamma-SNN's SRAM traffic is exacerbated by the t-dim in SNNs, resulting in an average of 13.4&#215; more SRAM traffic than LoAS.</p><p>To better visualize the analysis above, we provide a memory traffic breakdown in Figure <ref type="figure">14</ref> for the three SNN layers in Table <ref type="table">II</ref>. As shown in the figure, SparTen-SNN has the largest input off-chip traffic, while GoSPA-SNN has the largest psum off-chip traffic across all workloads. Among the three baselines, Gamma-SNN has the smallest off-chip traffic footprint due to the Gust dataflow's on-chip reuse of partial rows. GoSPA-SNN also has the largest off-chip traffic for the compressed format due to its use of the CSR format for each spike. We note that LoAS has slightly larger (2.1&#215;) off-chip traffic for the compressed format compared to SparTen-SNN, as we need extra bitmasks to mark the positions of non-silent neurons, whereas SparTen-SNN can directly leverage the input spike trains. However, this overhead is negligible compared to LoAS's savings on off-chip traffic for other quantities.</p><p>Figure <ref type="figure">14</ref> also provides the normalized SRAM cache miss rate for the layer workload in ResNet19. SparTen-SNN has a 16&#215; higher miss rate (1.47%) compared to LoAS. GoSPA-SNN has the lowest miss rate due to its Output-stationary</p><p>TABLE IV AREA AND POWER BREAKDOWN OF LOAS (LEFT) AND ONE TPPE (RIGHT). Components Area (mm 2 ) Power (mW) TPPE units Area Power 16 TPPEs 0.96 45.1 Accumulators 2e-3 0.16 16 PLIFs 0.02 1.2 Fast Prefix 0.04 1.46 Global cache 0.80 124.5 Laggy Prefix 5e-3 0.32 Others 0.30 18.1 Others 0.01 0.88 Total 2.08 188.9 TPPE total 0.06 2.82 Global Cache 65.9% TPPEs 23.9% Others 10.2% Fast Prefix-Sum 51.8% Laggy Prefix-Sum 11.4% Others 31.2% Accs 5.6% Fig. 15. On-chip power breakdown for LoAS. 'Accs' refers to the accumulators, including 1 pseudo-accumulator and 4 correction accumulators.</p><p>dataflow, though this comes at the cost of higher off-chip traffic for psums. Gamma-SNN has a higher SRAM miss rate than GoSPA-SNN and LoAS due to the extra t-dim enlarging the partial row traffic by t times, causing some of the extra traffic to be evicted from the on-chip SRAM. Overall, the cache miss rate results align with the off-chip traffic trends. Since all baselines have the same global cache size, the reduction in memory traffic reflects LoAS's improvements in both speedup and energy efficiency.</p><p>Area and Power: Table <ref type="table">IV</ref> shows the area and power breakdown of LoAS with the configuration detailed in Table III. Inside each TPPE, a single fast prefix-sum circuit dominates both area (66.7%) and power (51.8%). The original SparTen <ref type="bibr">[15]</ref> requires two fast prefix-sum circuits for both inputs and weights. <ref type="foot">10</ref> Thanks to the laggy prefix-sum circuits we proposed (which account for 8.3% of area and 11.4% of power), LoAS requires only one fast prefix-sum circuit inside each TPPE. At the system level, the global SRAM cache dominates both power and area, consistent with previous works <ref type="bibr">[30]</ref>, <ref type="bibr">[33]</ref>, <ref type="bibr">[60]</ref>. Figure <ref type="figure">15</ref> provides a visualization of the power distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Ablation Studies</head><p>Temporal Scalability Studies: In our experimental settings, we configured the TPPE inside LoAS to run SNNs with 4 timesteps. Most state-of-the-art SNN algorithms <ref type="bibr">[10]</ref>, <ref type="bibr">[12]</ref> typically use a decently small timestep (&#8804;8). Therefore, we aim to understand how TPPE scales with the number of timesteps. Figure <ref type="figure">16</ref>(a) demonstrates that TPPE scales well with the timesteps. This scalability is because all TPPE components, aside from the accumulators and the input data buffer, are agnostic to the number of timesteps. Even at 16 timesteps, the TPPE only increases its area (power) by 1.37&#215; (1.25&#215;) compared to 4 timesteps. We also examine how the ratio of</p><p>T=4 T=8 T=16 T=4 T=8 T=16 12.5% 22.2% 36.3% 8.4% 15.5% 26.8% (a) (b) origin FT silent neurons in VGG16 scales with the number of timesteps. Figure <ref type="figure">16</ref>(b) shows that with the aid of the preprocessing technique, even at 8 timesteps, we can maintain a similar ratio of silent neurons as with 4 timesteps. However, it is likely that the number of silent neurons will decrease when using even larger timestep counts (&gt;8). This presents a challenge for LoAS when scaling up to srelatively large number of timesteps.</p><p>Scalability Study: Figure <ref type="figure">17</ref> further illustrates how the overall performance of LoAS scales with different factors. We first test LoAS running on VGG16 with average sparsity levels of B (weight) at 98.2% (High), 68.4% (Medium), and 25% (Low). The results show that LoAS's performance is highly sensitive to the sparsity level of B. When the sparsity scales from 98.2% to 25%, LoAS's performance decreases by approximately 88%. We also observe that LoAS's performance scales well with the number of timesteps, with only a 14% reduction in performance when the number of timesteps is doubled. Finally, we assess LoAS's scalability with layer size by comparing one layer from VGG16 to the hidden feed-forward (HFF) layer from SpikeTransformer [56] (V-L8 and T-HFF in Table II). The results demonstrate that LoAS scales effectively, even with layers of larger parameter sizes. Dual-sparse SNN vs. Dual-sparse ANN:</p><p>In this work, we aim to provide insights for the community on how to accelerate spMspM for SNNs. However, it's also important to discuss the performance comparison between SNNs and ANNs. Figure <ref type="figure">18</ref> shows the comparison of normalized energy efficiency and memory traffic between SNNs (LoAS) and ANNs (SparTen <ref type="bibr">[15]</ref> and Gamma <ref type="bibr">[60]</ref>) running the VGG16 workload. We use the VGG16 workload in Table <ref type="table">II</ref> for LoAS. The ANN version of VGG16 has 8-61.7% 60.2% 63.4% comp bit weights (98.2% sparsity) and activations (43.9% sparsity).</p><p>Overall, the SNN running on LoAS achieves roughly 2.5&#215; and 1.2&#215; greater energy efficiency compared to the ANNs running on SparTen and Gamma, respectively. We observe that around 60% of energy consumption is due to data movement in both networks. Therefore, we also include a comparison of DRAM and SRAM traffic in Figure <ref type="figure">18</ref>. The figure shows that SNNs, on average, have approximately 60% less memory traffic compared to SparTen-ANN. This reduction in memory traffic is due to the lower input bitwidth (4-bit vs. 8-bit) and higher input sparsity (79.6% vs. 43.9%), thanks to SNN's features of unary activation and sparse spike activity (Section II-B). Not surprisingly, Gamma-ANN has lower overall DRAM access compared to LoAS due to its Gust dataflow <ref type="bibr">[60]</ref>. The tradeoff, however, is 3.5&#215; more SRAM traffic, which explains why LoAS has slightly higher overall energy efficiency.</p><p>Dual-sparse SNN vs. Dense SNN: To demonstrate the benefits of dual-sparsity in SNNs, we compare LoAS with prior dense SNN systolic-array accelerators, PTB <ref type="bibr">[28]</ref> and Stellar <ref type="bibr">[32]</ref>, running dense VGG16 with 4 timesteps. For a fair comparison, we set the array size for PTB to 16 &#215; 4, which generates 16 full-sum outputs for 4 timesteps in parallel, the same as LoAS. We also configure Stellar with the same array size. We leverage ScaleSim <ref type="bibr">[42]</ref> to estimate both baselines' memory traffic and cycle counts. The comparison is shown in Figure <ref type="figure">19</ref>.</p><p>We first observe that LoAS has roughly 6&#215; higher energy efficiency compared to PTB, primarily due to 3&#215; (12.5&#215;) less DRAM (SRAM) traffic. Compared to Stellar, LoAS achieves approximately 2.5&#215; higher energy efficiency, along with 2.7&#215; (6.6&#215;) less DRAM (SRAM) traffic. Additionally, LoAS shows a 46.9&#215; speedup over PTB, mainly due to the data sparsity and the difference between PTB's partially temporal-parallel mechanism (Section II-E) and LoAS's fully temporal-parallel mechanism. We also observe that Stellar outperforms PTB across all matrices, primarily due to Stellar's optimized spatiotemporal row-stationary dataflow and spikeskipping technique. However, compared to Stellar, LoAS still achieves approximately 7.1&#215; speedup due to its ability to leverage dual-sparsity. It is important to note that we do not compare with SpinalFlow <ref type="bibr">[35]</ref> due to its temporal encoding achieving limited accuracy on challenging tasks <ref type="bibr">[6]</ref>, <ref type="bibr">[28]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. RELATED WORK</head><p>In addition to the prior dense SNN accelerator works discussed in Section II-E, there are also prior works that attempt to leverage sparsity in SNNs. In <ref type="bibr">[3]</ref>, a neuron filter unit is used to fetch weights only when there is a 1-spike. However, dual-sparsity (both spike and weight sparsity) is not considered. In <ref type="bibr">[2]</ref>, dual-sparsity in SNNs is considered to skip unmatched computations, but the weights and spikes are fetched in dense format without any compression from offchip memory, thereby failing to reduce data movement costs. In contrast, LoAS leverages dual-sparsity in SNNs for both computation and data movement.</p><p>As discussed earlier, PTB processes timesteps in a partially parallel manner. Even if PTB is reconfigured to run all timesteps in parallel (time-window=1), it still differs from LoAS in loop ordering. In PTB's loop ordering, the t-dim is placed between the m-dim and n-dim, while LoAS places the t-dim in the innermost loop. As discussed in Section III, LoAS's loop ordering brings greater efficiency in spMspM operations. Moreover, PTB is designed to accelerate workloads with time-series data from DVS sensors <ref type="bibr">[29]</ref>, where the number of timesteps is typically large (&gt;100). For our workloads, where the number of timesteps is small (&#8804;8), PTB experiences low hardware utilization. In <ref type="bibr">[31]</ref>, processing timesteps in parallel is also explored. However, the focus is on temporal-coded SNN workloads, and loop ordering is not discussed. Finally, as mentioned in Section II-E, Stellar <ref type="bibr">[32]</ref> is another work that attempts to process timesteps in parallel. However, it targets non-LIF, FS-coded SNNs and does not support dual-sparsity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSION</head><p>In this work, we observed that naively running dual-sparse SNNs on existing spMspM accelerators results in suboptimal efficiency due to the latency and memory traffic penalties associated with processing timesteps. To address this issue, we proposed a fully temporal-parallel dataflow (FTP) that eliminates these problems. To maximize the benefits of FTP, we introduced FTP-friendly spike compression and an inner-join mechanism. Additionally, we developed LoAS, a novel architecture that exemplifies the FTP dataflow. With the support of both FTP-friendly compression and inner-join, LoAS achieves significant speedup (up to 8.51&#215;) and energy reduction (up to 3.68&#215;) compared to prior dual-sparse accelerator baselines.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We do not compare with these systems due to our focus on single-core dataflow SNN accelerator designs in this work.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>In this work, we focus on the hard reset scheme, where the membrane potential is reset to zero if an output spike of one is generated. Although other reset schemes exist, using one of them does not affect the generality of the hardware design.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Other implementations using multiplexers exist<ref type="bibr">[28]</ref>,<ref type="bibr">[35]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>Timestep is also referred to as a tick<ref type="bibr">[35]</ref> or time-point<ref type="bibr">[28]</ref> in other works. We follow the naming convention used in the latest SNN algorithm research.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>For 128 columns, log 2 (128) = 7 bits are required for the coordinates. We neglect the offsets in this discussion, which would further increase the number of bits used for coordinates.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_5"><p>&amp; &amp;</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>In<ref type="bibr">[15]</ref>, the design of the prefix-sum circuit is not described in detail. We assume it to be a tree-like prefix-sum circuit with O(log(n)) complexity that can run in one clock cycle. We define this design as the fast prefix-sum circuit. The size of the input and output for the prefix-sum circuit is set to 128 in both<ref type="bibr">[15]</ref> and our work.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>Source codes: https://github.com/Intelligent-Computing-Lab-Yale/LoAS</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8"><p>Adding the t dimension anywhere else would increase data traffic, thus worsening performance.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9"><p>This is not the case in SparTen-SNN. Since the input spikes function as both bitmasks and data, SparTen-SNN only requires one fast prefix-sum circuit.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_10"><p>Authorized licensed use limited to: Yale University Library. Downloaded on July 07,2025 at 19:37:26 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
