<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>04/27/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10543841</idno>
					<idno type="doi">10.1145/3620666.3651381</idno>
					
					<author>Xiaoyang Lu</author><author>Boyu Long</author><author>Xiaoming Chen</author><author>Yinhe Han</author><author>Xian-He Sun</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer architectures. Recently dedicated SpMM accelerators have been proposed to enhance SpMM performance. However, current SpMM accelerators still face challenges in adapting to varied sparse patterns, fully exploiting inherent parallelism, and optimizing cache performance. To address these issues, we introduce ACES, a novel SpMM accelerator in this study. First, ACES features an adaptive execution flow that dynamically adjusts to diverse sparse patterns. The adaptive execution flow balances parallel computing efficiency and data reuse. Second, ACES incorporates locality-concurrency co-optimizations within the global cache. ACES utilizes a concurrency-aware cache management policy, which considers data locality and concurrency for optimal replacement decisions. Additionally, the integration of a non-blocking buffer with the global cache enhances concurrency and reduces computational stalls. Third, the hardware architecture of ACES is designed to integrate all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1× speedup and marking a substantial advancement in SpMM acceleration.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1&#215; speedup and marking a substantial advancement in SpMM acceleration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Sparse matrix-matrix multiplication (SpMM) is a computational process where two sparse matrices are multiplied. SpMM is a cornerstone in scientific simulations <ref type="bibr">[7]</ref>, linear algebra <ref type="bibr">[44,</ref><ref type="bibr">57]</ref>, graph analytics <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">8,</ref><ref type="bibr">24,</ref><ref type="bibr">47,</ref><ref type="bibr">59]</ref>, and the rapidly evolving fields of deep learning <ref type="bibr">[14,</ref><ref type="bibr">19,</ref><ref type="bibr">20,</ref><ref type="bibr">33,</ref><ref type="bibr">42]</ref>. Its crucial role in efficiently processing large-scale data structures and complex algorithms makes the effective acceleration of SpMM not just a computational challenge, but a key enabler in advancing researches and applications in these diverse and impactful domains.</p><p>The processing efficiency of SpMM is heavily influenced by the characteristics of the input sparse matrices. The high proportion of zero elements in these matrices leads to challenges such as low utilization of computational and memory resources. These irregular sparse patterns, coupled with unpredictable memory access patterns, present significant obstacles for conventional cache-based computing architectures, which are typically optimized for dense and regular data patterns. As a result, SpMM often becomes a performance bottleneck, particularly in an era where processing large datasets is increasingly crucial <ref type="bibr">[15]</ref>. Given the critical role of SpMM in various computational domains, developing specialized accelerators to enhance SpMM performance is important.</p><p>Existing SpMM accelerators predominantly utilize fixed execution flows, such as inner-product (InP) <ref type="bibr">[22,</ref><ref type="bibr">45]</ref>, outerproduct (OutP) <ref type="bibr">[43,</ref><ref type="bibr">61]</ref>, or row-by-row (ROW) <ref type="bibr">[50,</ref><ref type="bibr">60]</ref>, each tailored to optimize either input or output data reuse. However, the efficiency of each execution flow is determined by sparse patterns, resulting in inconsistent performance across different sparse matrices. For instance, in InP, the sparse pattern critically influences the index intersection between the two input matrices, affecting input data fetching efficiency. In OutP, the sparse pattern determines the size of the resulting partial matrices, making output data reuse optimization crucial. In the case of ROW, the distribution of non-zeros in the input matrices influences the effectiveness of input data reuse, impacting overall data access efficiency. This variability in performance due to differing sparse patterns underscores the need for SpMM execution flow that can dynamically adapt to optimize performance across a range of matrix structures.</p><p>Additionally, the existing SpMM accelerators exhibit limited capabilities in exploiting the inherent parallelism in SpMM, leading to several inefficiencies. First, the inherent dependencies in the execution flow can lead to a collective dependency of hardware processing elements (PEs) on completing a batch of tasks. In other words, multiple PEs processing the same batch must wait for all tasks to be completed before proceeding, typically leading to the underutilization of PEs and prolonged pipeline latency. Second, synchronization challenges <ref type="bibr">[50]</ref> frequently arise in existing accelerators, especially when multiple mergers are tasked with working on the same region of the output matrix. In such situations, the mergers must carefully sequence their work to ensure both the correctness and coherence of the merging process. Therefore, the merge process, responsible for consolidating the multiplication results, often becomes a bottleneck due to the need for synchronization. These inefficiencies impede the effective utilization of hardware resources and limit the overall performance of SpMM operations. This underscores the need for designs that better align execution flows with the architecture, enhance fine-grained parallelism, and mitigate the synchronization overhead, thereby maximizing the potential of parallel processing.</p><p>Furthermore, cache performance is crucial for the efficiency of SpMM accelerators, but is often overlooked. Conventional cache replacement policies used in SpMM accelerators <ref type="bibr">[32,</ref><ref type="bibr">60,</ref><ref type="bibr">61]</ref> aim to reduce the number of cache misses but overlook the importance of concurrency. In SpMM operations, it is common for multipliers to request cache lines of a row or column concurrently. This concurrency means that even a single cache miss can cause delays in the entire processing chain. Moreover, the design of caches in current accelerators does not incorporate non-blocking features. As a result, a single cache miss causes delays in subsequent accesses, thereby exacerbating performance bottlenecks. These challenges underscore the pressing need for advanced cache optimizations in SpMM accelerators. Such optimizations must be tailored to efficiently handle the unique access demands of SpMM, considering concurrent accesses and ensuring non-blocking cache operations to optimize the overall performance of the accelerator.</p><p>In this paper, we introduce ACES, an innovative accelerator for SpMM, specifically designed to dynamically adapt its execution flow to accommodate varying sparsity patterns, optimize parallel execution, and implement localityconcurrency co-optimizations for on-chip cache. ACES has the following unique features.</p><p>&#8226; Adaptive execution flow. ACES is equipped with an adaptive execution flow that intelligently adjusts to varying sparsity patterns of input matrices. This adaptability is achieved through a spectrum of condensing degrees, implemented with minimal overhead and without altering the original encoding formats of the matrices. This ensures optimal performance across diverse matrix structures. &#8226; Balanced data reuse and synchronization. ACES considers the reuse of input data and the synchronization needed for merging partial output results from each execution flow. By balancing memory access behavior with parallelism, it selects the most suitable condensing degree for various sparse patterns, enhancing efficiency. &#8226; Concurrency-aware cache management. ACES employs PureFiber, a concurrency-aware cache replacement policy, to optimize cache management. This policy accounts for both reuse distance and potential concurrent accesses, aiming to minimize total stalls caused by cache accesses. &#8226; Non-blocking buffer integration. ACES incorporates a non-blocking buffer to manage cache miss accesses, ensuring that cache misses do not significantly disrupt subsequent accesses, thereby improving data access concurrency. &#8226; Tailored hardware architecture. The hardware architecture of ACES is specifically designed to complement its adaptive execution flow and cache optimizations. The architecture dedicates an addition processing element (APE) to each multiplication processing element (MPE), providing fine-grained parallelism and minimizing inter-PE dependencies.</p><p>We evaluate ACES against three state-of-the-art SpMM accelerators, including SIGMA <ref type="bibr">[45]</ref>, SpArch <ref type="bibr">[61]</ref>, and SPADA <ref type="bibr">[32]</ref>, across a variety of workloads with diverse sparse patterns. Overall, ACES not only consistently provides optimal performance across all workloads, with average speedups of 25.5&#215; over SIGMA, 8.9&#215; over SpArch, and 2.1&#215; over SPADA, but also achieves this with the lowest area cost, further underscoring its efficiency and innovation in SpMM acceleration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Execution Flows for SpMM</head><p>The multiplication of two sparse matrices involves specific execution flows, dictating how the matrices are accessed and processed. The three primary execution flows in SpMM are InP <ref type="bibr">[22,</ref><ref type="bibr">45]</ref>, OutP <ref type="bibr">[43,</ref><ref type="bibr">61]</ref>, and ROW <ref type="bibr">[50,</ref><ref type="bibr">60]</ref>. Figure <ref type="figure">1</ref> demonstrates how matrices A and B are multiplied to produce output matrix C using each execution flow. Each execution flow exhibits distinct characteristics in terms of input and output reuse, index intersection efficiency, and synchronization requirements.</p><p>InP computes each element of matrix C by calculating the inner product of a corresponding row of matrix A and a column of matrix B. InP achieves full reuse of matrix C, as all partial results for each element are immediately accumulated. Additionally, InP allows multiple PEs to compute different elements of the output matrix simultaneously, thus minimizing synchronization issues. However, it faces challenges in efficiently reusing matrix B, since each column of matrix B must be refetched multiple times for every row in matrix A. Moreover, InP often encounters inefficiencies in index intersection due to the sparsity of matrices A and B. When fetching an entire row of A and an entire column of B for computation, only non-zero elements with matching indices contribute to C. Given the sparsity, many fetched elements lack matching indices, leading to numerous redundant operations. This inefficiency is exemplified in Figure <ref type="figure">1(a)</ref>, where the index intersection between the last row of matrix A and the last column of matrix B is illustrated.</p><p>OutP computes the outer product of a row of matrix A with a column of matrix B, forming one partial matrix of C at a time. These partial matrices are subsequently merged to produce the final output matrix C. OutP allows for efficient reuse of the input matrices, as each row of A and each column of B are fetched only once. Additionally, OutP mitigates the inefficiencies in index intersection seen in InP, since only entirely zero columns of A and rows of B do not contribute to the output, a scenario that is relatively rare. However, OutP faces challenges in managing the size and number of partial output matrices, particularly when dealing with matrices that have highly irregular sparsity patterns. The cumulative size of all partial matrices often surpasses that of the final output matrix, leading to significant memory traffic and computational complexity during the merging process. This issue becomes even more pronounced when handling large matrices, where the memory and computational demands are substantially increased. Furthermore, when multiple PEs are involved in generating and merging these partial matrices, synchronization becomes necessary to ensure a correct merging process. A large number of synchronization requirements can significantly limit the efficiency of the accelerator and the utilization of PEs.</p><p>ROW operates based on row-wise partitioning of the input matrices. For each row C &#287; of the output matrix C, ROW calculates the result by merging the scalar-vector products of each non-zero element &#279; &#287;,&#289; in row A &#287; with the corresponding row B &#289; of matrix B. ROW generates and stores partial results for a single output row at a time, enabling efficient on-chip management and good reuse of matrix C. In terms of index intersection between the two input matrices, ROW is efficient, as it fetches only those pairs of non-zeros that are already matched. However, ROW faces challenges in efficiently reusing the rows of matrix B due to irregular access patterns driven by the non-zero distribution in matrix A. Moreover, when PEs work concurrently to generate partial results for the same output row, synchronization becomes necessary, as depicted in Figure <ref type="figure">1</ref>(c), potentially creating bottlenecks in highly parallel systems.</p><p>While conventional execution flows in SpMM provide distinct characteristics to matrix multiplication, there is no universally optimal solution. The complexity of handling highly irregular sparsity patterns, coupled with the demands of parallel computing, underscores the need for innovative, adaptive execution flows that can efficiently balance data reuse, index intersection, and parallelism.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">SpMM Accelerators</head><p>SIGMA <ref type="bibr">[45]</ref> is an InP-based SpMM accelerator that enhances index intersection efficiency through a bitmap format for sparse matrix representation. This format ensures that only essential computations are conducted. Additionally, SIGMA employs flexible interconnections among PEs to optimize their utilization. SpArch <ref type="bibr">[61]</ref> adopts an OutP execution flow for SpMM, specifically focusing on enhancing the efficiency of the merging process. It proposes an aggressively condensed matrix representation for matrix A, which reduces the number of partial matrices produced during multiplication but may compromise input reuse. Additionally, SpArch integrates a high-radix merger, pipelining the production and merging of partial matrices. SPADA <ref type="bibr">[32]</ref> inherits ROW and introduces a window-based adaptive (WA) execution flow to effectively adapt SpMM to various sparse patterns, thereby addressing the limitations of traditional fixed execution flows. The specialized hardware architecture of SPADA includes multiple multipliers, which are responsible for concurrently executing scalar-vector multiplications within a WA window. However, WA introduces a collective dependency among the multipliers. This dependency dictates that a multiplier can only proceed to the next window once all the multiplication tasks in the current window are completed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Challenges and Opportunities</head><p>SpMM accelerators with fixed execution flows, such as SIGMA <ref type="bibr">[45]</ref> and OuterSPACE <ref type="bibr">[43]</ref>, have introduced innovations in accelerating SpMM. However, they also encounter challenges due to the limitations of their fixed execution flows. In contrast, the WA execution flow of SPADA <ref type="bibr">[32]</ref> offers an innovative approach to adapt to diverse sparse patterns. However, WA relies on collective dependencies among multipliers for coordinated execution, which, while essential, introduces potential bottlenecks in hardware parallelism. This dependency can limit the efficiency and scalability of the system, especially in cases where high parallel processing is required.</p><p>Moreover, the performance of on-chip caches presents a significant challenge in the context of concurrent data demands inherent in SpMM operations. Current accelerators <ref type="bibr">[32,</ref><ref type="bibr">60,</ref><ref type="bibr">61]</ref> optimize cache performance on data locality. However, data concurrency is equally important <ref type="bibr">[36,</ref><ref type="bibr">38,</ref><ref type="bibr">46,</ref><ref type="bibr">52]</ref>. Effective cache management should consider data locality and concurrency together. In addition, the on-chip cache should handle cache misses in a non-blocking manner, allowing the cache to continue servicing other requests efficiently. The non-blocking design could enhance the data concurrency and reduce computational stalls.</p><p>Leveraging insights from the existing accelerators and the characteristics of SpMM, ACES differentiates itself with an adaptive execution flow that balances data reuse and parallelism, which effectively overcomes the limitations of fixed execution flows and the collective dependencies of WA. Furthermore, ACES emphasizes the co-optimizations of locality and concurrency in on-chip cache design and management. We introduce the details of ACES in the subsequent section.  Memory Global Cache NB Buffer Mat A Fetcher Mat B Fetcher Global Buffer MPE MPE . . . ! !,# ! !,$ Mat B Row Fiber Mat A Element " %,&amp; " %,$ Cached Mat C Partial Row Fiber Mat C Partial Row Fiber from MPE Condensing Adapter PureFiber Policy a 3,5 0 1 2 3 4 5 0 1 2 3 Mat A 0 1 2 3 0 1 2 3 4 5 Condensed A 0 1 2 3 4 5 0 1 2 3 Mat B " %,# ' " %,$ ' " %,# ' " %,$ ' " %,&amp; " %,# ' " %,$ + " %,$ ' Mat C Partial Row Fiber APE . . . APE APE B 5 Sync Scheduler Merging Scheduler SQ SQ Figure 2. Overview of ACES.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">ACES Architecture</head><p>global cache, which is integrated with a non-blocking buffer (NB buffer) to organize a non-blocking cache system. These components synergize to build the key features of ACES: an adaptive execution flow, a tailored hardware architecture, and advanced locality-concurrency co-optimizations in cache management.</p><p>Adaptive execution flow. ACES incorporates a condensation adapter that dynamically tunes the condensed matrix representation for matrix A to optimize the execution flow, as depicted in the top left of Figure <ref type="figure">2</ref>. This feature allows for flexibility in handling varying sparsity patterns of matrix A (more details in Section 3.2). A fetcher for matrix A retrieves elements from the condensed columns, placing them in a global buffer, specifically designed as a lightweight buffer to store elements of matrix A, while another fetcher fetches necessary rows of matrix B into the global cache.</p><p>Tailored hardware architecture. ACES employs parallel computing, utilizing MPEs and APEs to support its adaptive execution flow (detailed in Section 3.3). Each MPE, illustrated in the bottom left of Figure <ref type="figure">2</ref>, loads a distinct non-zero element from the condensed column of matrix A and a corresponding row from matrix B. These MPEs execute scalarvector multiplications in parallel, as shown in the bottom middle of Figure <ref type="figure">2</ref>. In conjunction with each MPE, each APE independently performs immediate merging of the partial output rows produced by the MPE with the corresponding partial result stored in the global cache, as demonstrated in the bottom right of Figure <ref type="figure">2</ref>. A synchronization scheduler is designed to mitigate synchronization conflicts between APEs when processing immediate merging (more details in Section 3.4). After completing all multiplication operations, the APEs, under the coordination of a merging scheduler (detailed in Section 3.4), engage in the final merging stage to merge the remaining partial results into the final output matrix.</p><p>Locality-concurrency co-optimizations for global cache. The global cache in ACES is a critical component, designed to efficiently manage both matrix B rows and matrix C partial output rows. The PureFiber cache replacement policy is employed to manage cache lines effectively by considering both data locality and concurrency (more details in Section 3.5). The integration of a NB buffer in the global cache (more details in Section 3.6) supports the PureFiber policy and enhances performance by facilitating non-blocking accesses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Adaptive Execution Flow</head><p>Inspired by ROW, in ACES, each MPE is tasked with a scalarvector multiplication, involving a single non-zero element from matrix A and the corresponding row in matrix B. This design addresses the inefficiencies of index intersection in InP and reduces the collective dependency among MPEs. Moreover, generating partial results at row granularity offers several advantages. It supports immediate merging, which pipelines both multiplication and merging processes. Additionally, it alleviates the challenges in OutP, where transferring entire partial matrices during the merging stage can significantly increase memory traffic. However, the ROW method compromises the reuse of matrix B, as non-zero elements from the same row of matrix A are often processed concurrently by MPEs, leading to requests for different rows of matrix B. To optimize the reuse of matrix B, we traverse the non-zero elements of matrix A column by column, similar to OutP. Figure <ref type="figure">3</ref>(a) demonstrates this execution flow when no condensing is applied to matrix A. While this execution flow enhances the reuse of matrix B, it can also lead to multiple MPEs generating partial results for the same output matrix row, particularly when matrix A has high sparsity. This situation introduces synchronization challenges among APEs during the immediate merging.</p><p>An aggressive condensed matrix representation for matrix A involves shifting all non-zero elements to the leftmost columns, which results in much denser condensed columns. Despite condensation, each non-zero element retains a record of its original column index to ensure computational accuracy <ref type="bibr">[61]</ref>. With the aggressive condensed matrix representation, traversal of non-zero elements of matrix A occurs through condensed columns, impacting the execution flow of SpMM. In terms of implementation, matrix A is stored in the CSR format, where each row is an ordered list of column indices. The elements sharing the same index in each ordered list are fetched and assigned to MPEs for parallel computing. As shown in Figure <ref type="figure">3</ref>(b), when four MPEs concurrently perform scalar-vector multiplications using elements from a condensed column of matrix A and corresponding rows from matrix B, the potential for reusing rows from matrix B diminishes due to simultaneous requests for different rows. However, aggressive condensing often generates distinct partial rows for the output matrix during parallel scalar-vector multiplications. This results in a diminished need for synchronization among APEs during immediate merging.</p><p>In an effort to strike a balance between data reuse and parallel computing efficiency, we introduce a moderately condensed matrix representation for matrix A. As depicted in Figure <ref type="figure">3</ref>(c), the columns of matrix A are divided into two distinct groups: the first half in one group and the second half in the other. Within each group, non-zero elements of matrix A are shifted to the leftmost columns, which is the same as in aggressive condensing. For the example in Figure <ref type="figure">3</ref>, the moderate condensing of matrix A enhances the reuse of matrix B when compared with aggressive condensing. Furthermore, it reduces synchronization challenges that might emerge without condensing.</p><p>Each condensing degree tries to offer a trade-off between data access and parallelism. However, given the complex and varied sparsity patterns in matrices, no single condensing degree can consistently outperform others across all workloads. This highlights the critical need for an adaptive mechanism, tailored to dynamically adjust to the unique sparse pattern of each input, thereby optimizing overall performance. ACES introduces such a mechanism, offering a spectrum of condensing degrees to ensure an optimized execution flow tailored to each workload.</p><p>In ACES, we employ a condensing adapter to dynamically adjust the representation of matrix A among three condensing degrees: none, moderate, and aggressive. We first partition the entire matrix into bands. Drawing on insights from <ref type="bibr">[32]</ref>, we recognize that adjacent rows with similar distributions of non-zero elements tend to have a stable row length (number of non-zero elements). Leveraging this observation, the partitioning of the matrix into bands is primarily based on row lengths, as indicated by the CSR offsets array, ensuring a fast and lightweight partitioning. A new band is established whenever the absolute difference in row length between adjacent rows surpasses a specified threshold, which we empirically set to be 10. Considering that each band displays a similar sparse pattern, we select the optimal condensing degree for each. For large bands, consisting of at least 256 rows, we identify the optimal condensing degree through the sampling phase. In the sampling phase, we execute three sample passes, each containing 32 rows. During these passes, we execute the SpMM with different condensing degrees and monitor the overall execution time, including both multiplication and immediate merging tasks. The condensing degree resulting in the best execution time is then applied to the remaining rows in the band, as it offers the most efficient balance between data reuse and parallelism. For small bands, we apply moderate condensing by default due to insufficient data for sampling, striking a balance between no and aggressive condensing.</p><p>Once the execution flow is determined, the fetcher for matrix A loads non-zero elements of matrix A into the global buffer ahead of execution. Concurrently, the fetcher for matrix B rows retrieves the row corresponding to the matrix A element being fetched and places it in the global cache. Each row of matrix B, stored as a fiber <ref type="bibr">[32,</ref><ref type="bibr">60]</ref>, is a list sorted by coordinate, consisting of the coordinates and values of each non-zero element. Then, the element from matrix A and the corresponding fiber of matrix B are dispatched to an available MPE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Processing Elements</head><p>Each MPE in ACES is specifically designed for executing scalar-vector multiplications, a fundamental operation in SpMM. The partial output fiber produced by an MPE is filled into a corresponding selective queue (SQ), buffers the sequentially generated products from the multiplier. Due to the independence of each scalar-vector multiplication, every MPE can operate concurrently, avoiding collective dependencies and thereby maximizing processing element utilization. ACES adopts a distinctive one-to-one pairing of MPEs with APEs, facilitating efficient processing of different rows of the output matrix in SpMM. The bottom middle of Figure <ref type="figure">2</ref> shows the example of MPE executes the scalar-vector multiplication between &#279; 3,5 with the corresponding row B 5 .</p><p>In tandem with MPEs, APEs play a crucial role, particularly in the immediate merging of partial output fibers. Each APE handles the merging of newly produced partial output fibers from SQ with corresponding partial output fibers previously generated and stored in the global cache. This design allows APEs to initiate the merging process as soon as a partial fiber becomes available in the corresponding SQ, thus enhancing the efficiency of the system. The merging of each fiber is achieved by walking two pointers over the two fibers, comparing the corresponding coordinates, merging them accordingly if the coordinates match, and then advancing the pointers based on the comparison results. The outcome of this merging is a new fiber of matrix C, which is a sorted merge of the two input fibers and is then written back to the global cache for further processing. The bottom right of Figure <ref type="figure">2</ref> illustrates an APE executing the merging between two partial fibers. In instances where there is no partial fiber in the global cache that can be merged with the partial fiber loaded from the SQ, the APE will write the partial fiber into the cache directly. This approach is implemented to prevent the stalls that could arise from waiting for a matching fiber to be fetched from memory. Immediate merging at row granularity in ACES facilitates the easy storage of partial fibers in the global cache and ensures effective reuse of matrix C, consequently reducing memory traffic. The independent and concurrent processing by the APEs, in collaboration with the synchronization scheduler, enhances system parallelism and optimizes the utilization of MPEs. Meanwhile, the MPEs continue with subsequent multiplication tasks, further amplifying the parallel processing capabilities of ACES.</p><p>In immediate merging, APEs focus on merging newly generated partial fibers with those currently stored in the global cache. However, due to the limited capacity of the global cache, it is not feasible to keep all partial fibers there. Consequently, some partial fibers are periodically written back to DRAM. This necessitates a final merging stage after the completion of all scalar-vector multiplications. During the final merging, every APE in the system participates, with each APE merging two partial fibers at a time. ACES employs a merging scheduler to optimize this final merging process, ensuring the final output matrix is assembled efficiently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Schedulers</head><p>In ACES, two schedulers are introduced: the synchronization scheduler, which strives to streamline the immediate merging process and reduce synchronization delays, and the merging scheduler, managing the final merging stage to minimize memory accesses.</p><p>Synchronization scheduler. Once the partial fibers are generated by MPEs, they are buffered in the SQs. ACES utilizes the synchronization scheduler to efficiently assign fibers to APEs for initiating the immediate merging process. The synchronization scheduler coordinates with SQs and APEs. It tracks the rows of partial fibers that the APEs are currently merging to schedule subsequent merge tasks that minimize stalls of the APEs due to synchronization issues.</p><p>When an APE becomes available, the synchronization scheduler evaluates the head fiber of the corresponding SQ. The design of the SQ acts similarly to a normal FIFO, but allows for selective access to stored fibers, enabling the scheduler to select an appropriate fiber to minimize synchronization conflicts. If the top fiber in an SQ cannot be processed due to another MPE currently updating the corresponding row in the cache, the scheduler selects the next available fiber that does not have a synchronization issue. This approach ensures continuous processing and minimizes idle time for the APEs. In cases where the top fibers in different SQs correspond to the same row, the scheduler randomly picks an available APE from the corresponding APEs to first merge these two partial fibers. After this initial merge, the APE then merges the result with the corresponding partial fibers stored in the cache. For scenarios without synchronization conflicts, the scheduler assigns the top fibers from the SQs directly to their corresponding APEs. With the help of the synchronization scheduler, execution stalls caused by synchronization issues are mitigated, and the parallelism of APEs is guaranteed.</p><p>Merging scheduler. In the final merging stage, the merging scheduler aims to minimize memory accesses. Drawing inspiration from SpArch <ref type="bibr">[61]</ref>, ACES adapts the Huffman tree, traditionally used in data compression for minimizing the total weighted path length of encoded symbols <ref type="bibr">[26]</ref>, to orchestrate the merging process for each row of the output matrix. Each Huffman tree in ACES is a binary tree, with leaf nodes representing partial fibers from a specific row of the output matrix. The weight of each node equates to the number of non-zero elements in the fiber it represents. The merging of fibers with the lowest weights forms internal nodes, each representing a merged result. The root node represents the fully merged fiber for that row. The bottom-up construction of the Huffman tree in ACES prioritizes the merging of fibers with fewer elements first. This approach not only reduces the number of memory loads and stores required during the final merging phase, but also decreases the total number of comparisons and operations needed, leading to a more efficient merging process. Figure <ref type="figure">4</ref> provides a simplified representation that assumes no intersection of non-zero elements between the leaf nodes, where the weight of each internal node is the sum of the weights of its child nodes. In practice, however, the actual number of non-zero elements post-merging is often lower due to the presence of intersections.</p><p>In the practical implementation of ACES, the Huffman tree is constructed using a priority queue. Initially, weights of leaf nodes, each representing a partial fiber, are entered into the queue. In each iteration, the two fibers with the lowest weights are extracted from the queue and merged, forming a new task encapsulated as a vector of fibers. With the updated weight, the merged fiber is then reinserted into the priority queue for potential future merging operations. This process repeats until all partial fibers corresponding to an output row are merged into a single fiber. Upon completing the merging tasks for one row, the priority queue is efficiently repurposed for the next, reducing the need to maintain a separate queue for each row. The merging scheduler holds the determined tasks temporarily in a small buffer, preserving the order in</p><p>Step 0 Fetch B 0</p><p>Step 1 Fetch B 1</p><p>Step 2 Fetch B 2</p><p>Step 3 Fetch B 4</p><p>Step 4 Fetch B 1</p><p>Step 5 Fetch B 0</p><p>Step 6 Fetch B 2</p><p>Step 7 Fetch B 3</p><p>Step</p><p>Step 0 Fetch B 0</p><p>Step 1 Fetch B 1</p><p>Step 2 Fetch B 2</p><p>Step 3 Fetch B 4</p><p>Step 4 Fetch B 1</p><p>Step 5 Fetch B 0</p><p>Step 6 Fetch B 2</p><p>Step 7 Fetch B 3</p><p>Step 8 Fetch B 0 Stall Stall Stall Replace B 1 Stall Replace B 4 Stall Stall Replace B 1 then B 2 Pure Fiber Pure Fiber Pure Fiber Pure Fiber which they were created. When an APE becomes available, the merging scheduler prioritizes selecting the next task that minimizes synchronization conflicts. The merging scheduler adheres to the optimal merging order prescribed by the Huffman trees and mitigates synchronization risks, enhancing the overall processing efficiency in the final stage of SpMM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Global Cache and PureFiber Policy</head><p>The global cache in ACES, organized as a multi-banked, setassociative cache, stores fibers of matrix B and partial output fibers of matrix C, supporting cache line granularity accesses for flexible capacity sharing among various fibers. During MPE operation in ACES, multiplying an element from matrix A with elements from the corresponding row in matrix B leads to concurrent requests for cache lines of matrix B. A single cache miss among these requests can stall the scalarvector multiplication process, as the MPE needs all required data from matrix B to produce a complete partial fiber. We define pure fiber as a scenario where cache lines of a fiber are accessed concurrently without any cache misses. Achieving a high number of pure fibers in SpMM is crucial for mitigating cache stalls. Contrasting with traditional accelerators <ref type="bibr">[32,</ref><ref type="bibr">61]</ref>, which mainly focus on reducing cache misses, we have developed PureFiber, a concurrency-aware cache replacement policy designed to prioritize achieving more pure fibers in SpMM. PureFiber integrates data locality and concurrency considerations for each cache line when making eviction decisions. It employs the Next Request Distance (RD), dynamically capturing the reuse distance to the fiber, which indicates the expected time until the fiber is next requested. The RD value is initialized upon cache line insertion or a cache hit and is decremented until the line is reused, thereby serving as a measure of temporal locality. Additionally, PureFiber evaluates the Fiber Density (FD), representing the number of cache lines in the corresponding fiber and serving as an indicator of potential concurrent accesses. When making eviction decisions, PureFiber selects the cache line with the highest combined sum of RD and FD for eviction. If multiple candidates are present, PureFiber prioritizes evicting the line with higher fiber density. By balancing data locality with concurrency, PureFiber optimizes cache management, focusing on increasing the number of pure fibers to support uninterrupted computations and enhance overall performance.</p><p>Figure <ref type="figure">5</ref> presents a simplified case study, all the cache capacity is used to store lines of matrix B. The leftmost column represents a condensed column from the condensed A matrix. Each element is annotated with its original column number. For instance, the element labeled &#255;0 originates from column 0 in the original matrix. Figure <ref type="figure">5</ref> top displays row fibers of matrix B, segmented according to the cache line length, as indicated at the top of the figure (B 0 to B 4 ). In this example, the cache can store at most 8 lines.</p><p>Figure <ref type="figure">5</ref>(a) illustrates Belady's OPT policy <ref type="bibr">[4]</ref>, which focuses solely on temporal locality. This policy prioritizes evicting cache lines with the largest RD values. From timestamps 0 to 2, lines from B 0 , B 1 , and B 2 are loaded into the cache. At timestamp 3, to accommodate lines from B 4 , the policy first evicts lines from B 2 and B 0 because they have larger RD values than those from B 1 . Then, to fully accommodate B 4 , line B 1-0 is also evicted. At timestamp 4, the miss of a single line from B 1 , despite four hits, leads to a stall in output fiber production. As demonstrated in Figure <ref type="figure">5</ref>(a), Belady's policy results in only one pure fiber. Figure <ref type="figure">5</ref>(b) illustrates the cache replacement decisions made by PureFiber under the same access pattern. In a non-blocking cache system capable of handling concurrent misses (detailed in Section 3.6), Pure-Fiber achieves three pure fibers. By considering data locality and prioritizing the retention of fibers with lower density, such as B 0 and B 2 , PureFiber secures two additional pure fibers at timestamps 5 and 6. This example underscores the ability of PureFiber to enhance performance by balancing considerations of locality and concurrency.</p><p>In ACES, PureFiber manages the cache lines for fibers in matrix B as well as the partial output fibers of matrix C. By prioritizing the retention of output fibers with smaller RD values, PureFiber increases the probability that corresponding partial results previously generated remain available in the cache for immediate use. Considering the locality of the partial output fibers of matrix C, APEs are able to participate in immediate merging, which enhances their utilization and reduces the number of tasks needed in the final merging stage. Additionally, the retention of fibers with smaller FD values helps prevent the APEs from engaging in timeconsuming immediate merging tasks, especially when an</p><p>NB Buffer Tag Subentries Global Cache Tag Data Cache Request A B Cache Miss C D E Memory Request Return Data Response Pending Misses Cache Hit MPE generates a low-density fiber that requires to be merged immediately with a much denser partial fiber already in the cache. By prioritizing the eviction of higher-density fibers, PureFiber improves the overall pipeline efficiency in ACES.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6">Non-Blocking Buffer</head><p>Traditional accelerators strive for speedups through extensive parallel computation. However, the effectiveness of parallelism is often constrained by the performance limitations of the memory system. In particular, a conventional blocking global cache stalls upon a cache miss, waiting until the missing cache line is retrieved from memory. This behavior significantly hinders SpMM performance due to its irregular access patterns and frequent cache misses. In response, ACES integrates an NB buffer with the global cache, creating an efficient non-blocking cache system. The non-blocking cache <ref type="bibr">[29]</ref>, widely adopted in modern processors, ensures that cache misses do not stall subsequent cache accesses. A non-blocking cache can handle a number of outstanding misses, significantly improving data concurrency <ref type="bibr">[2,</ref><ref type="bibr">38,</ref><ref type="bibr">46]</ref>. Figure <ref type="figure">6</ref> shows the organization and workflow of the non-blocking cache in ACES. For each global cache request ( A ), when a miss occurs, the corresponding miss information is stored in an entry of the NB buffer ( B ). NB buffer tracks all outstanding cache misses, with each entry referring to one missing cache line and containing multiple subentries for handling multiple misses to the same line. If it is the first miss to a specific cache line, a memory request is issued ( C ); subsequent misses for the same line are consolidated within the corresponding entry. While the cache continues to service other requests, the NB buffer fetches the missing data from the main memory. Once the data is retrieved, the buffer updates the global cache with the new data and resolves all associated misses ( D ). Once the misses are resolved, the corresponding entry in the NB buffer is released for future cache misses. All PEs or fetchers waiting for that cache line are then notified that the data is available in the cache ( E ). By allowing the cache to handle other requests during miss processing concurrently, the non-blocking design in ACES significantly reduces memory stall cycles and enhances concurrency. Table 1 presents the detailed configuration of ACES. The area of ACES is measured by writing RTL for core components, including MPEs, APEs, the condensing adapter, and schedulers, and synthesizing them using Synopsys Design Compiler on the TSMC 28 nm technology. CACTI 7.0 [3] is used to model the overhead of global cache, global buffer, and NB buffer.</p><p>For interconnections, we follow the approach used in previous work <ref type="bibr">[32]</ref> and model swizzle-switch networks <ref type="bibr">[48]</ref> to connect banks of the global cache with PEs, thereby facilitating concurrent accesses. Furthermore, a cycle-accurate simulator is built to accurately measure performance, PE utilization, and memory traffic. This simulator is utilized to model interactions between hardware components and implement ACES adaptive execution flow. We evaluate the performance of ACES using the SuiteSparse matrix collection <ref type="bibr">[12]</ref>, which is a widely recognized benchmark in prior works <ref type="bibr">[32,</ref><ref type="bibr">61]</ref>. Table <ref type="table">2</ref> presents a selection of 18 sparse matrices from this dataset, chosen for their wide range of sparse patterns and densities. The diverse set of matrices provides a comprehensive basis for evaluating the adaptability and efficiency of ACES across varying matrix characteristics. To construct SpMM workloads, we adhere to the methodology outlined in SPADA, wherein a square matrix is multiplied by itself and a non-square matrix is multiplied by its transpose, ensuring consistency and fairness in our evaluation.</p><p>For comparison, we selected three state-of-the-art SpMM accelerators: SIGMA <ref type="bibr">[45]</ref>, SpArch <ref type="bibr">[61]</ref>, and SPADA <ref type="bibr">[32]</ref>. SIGMA, an InP-based accelerator, and SpArch, which adopts the OutP execution flow, are chosen to represent two fundamental execution flows in SpMM accelerators. SPADA is included for its adaptive execution flow, which incorporates the ROW execution flow as one of its modes, offering a comprehensive perspective for comparison. To ensure fair comparisons among all accelerators, we standardized the hardware configurations. The number of multipliers was aligned to 16, following the configurations of SpArch and SPADA. For SIGMA, we scale down the Flex-DPE to a width of 16, reduce the SRAM buffer size to 1.5 MB, and increase the operating frequency to 1 GHz. Furthermore, each accelerator utilizes the same HBM module for off-chip data storage. The data precision across all accelerators is standardized to 64-bit double-precision, which is a common requirement in scientific computing applications <ref type="bibr">[32]</ref>. Regarding input formats, while ACES processes inputs in the CSR format, the other accelerators use their respective recommended formats, ensuring optimal operational conditions for each. When evaluating the performance of ACES, we include the cost associated with the sampling phases, ensuring a comprehensive and transparent performance comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiment Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Performance Comparisons</head><p>Figure <ref type="figure">7</ref> illustrates the performance of various SpMM accelerators, normalized to that of SpArch, across all evaluated workloads. We make four major observations. First, ACES consistently outperforms all state-of-the-art accelerators across every workload, highlighting its exceptional adaptability to diverse sparse patterns. On average, ACES achieves significant performance gains of 25.5&#215;, 8.9&#215;, and 2.1&#215; over SIGMA, SpArch, and SPADA, respectively. Second, SIGMA faces challenges in most workloads, despite utilizing a bitmap format for sparse matrix representation. Third, SpArch achieves a 2.9&#215; speedup over SIGMA on average, which is attributed to the better input reuse of OutP execution flow and its efforts to reduce off-chip traffic during the merging phase. Fourth, SPADA, with its WA execution flow, provides an adaptive execution flow and better performance compared with accelerators with a fixed execution flow. On average, SPADA leads to a 4.2&#215; speedup over SpArch and a 12.0&#215; speedup over SIGMA.</p><p>Figure <ref type="figure">8</ref> presents a comparison of the total off-chip memory traffic for matrix B and partial outputs across various workloads. The memory traffic associated with matrix B is crucial for the effectiveness of multiplications. Similarly, efficient handling of reads and writes for partial outputs is vital for the merging processes. Figure <ref type="figure">8</ref> demonstrates that ACES incurs the lowest off-chip memory traffic compared to recent SpMM accelerators. On average, the off-chip memory traffic of SIGMA is 11.6&#215; higher than that of ACES, while the traffic for SpArch and SPADA is 4.4&#215; and 3.1&#215; higher, respectively. SIGMA achieves efficient output reuse but faces significant challenges in input memory traffic. SpArch strives to reduce off-chip traffic during the merging phase through 0.0 0.5 1.0 1.5 2.0 2.5 3.0 cs az cc cg ca ee f3 mb m2 of pg pm p3 rc sc wg w1 wv GM Normalized Traffic SIGMA SpArch SPADA ACES 3.4 3.2 12.0 Figure 8. Memory traffic comparison among SIGMA, SpArch, SPADA, and ACES. 0.0 0.2 0.4 0.6 0.8 1.0 cs az cc cg ca ee f3 mb m2 of pg pm p3 rc sc wg w1 wv GM PE Utilization SIGMA SpArch SPADA ACES Figure 9. PE utilization comparison among SIGMA, SpArch, SPADA, and ACES.</p><p>an aggressive condensing representation of matrix A. However, it disrupts input reuse, leading to excessive fetching of different rows of matrix B and causing cache thrashing. SPADA, relying on its adaptive execution flow, reduces memory traffic compared with SIGMA and SpArch; however, it still exhibits a significant gap when compared with ACES. ACES, with its adaptive execution flow, effectively balances input and output reuse. The design of immediate merging, in conjunction with the PureFiber cache replacement policy, ensures finer granularity in merging partial results. Furthermore, the merging scheduler efficiently manages the final merging stage, collectively contributing to optimal traffic management.</p><p>Figure <ref type="figure">9</ref> compares PE utilization among four SpMM accelerators. On average, GAMMA, SpArch, SPADA, and ACES exhibit PE utilization rates of 54.8%, 80.0%, 86.9%, and 95.1%, respectively. Notably, ACES consistently maintains PE utilization rates above 90.0% in 15 of the 18 evaluated workloads. The high PE utilization of ACES is attributable to two main factors: First, its adaptive execution flow effectively mitigates synchronization risks. Second, the architecture of ACES features a one-to-one pairing of MPEs with APEs. This architecture, combined with efficient task scheduling by the synchronization scheduler during the immediate merging phase, reduces the collective dependency of MPEs and enhances the parallelism of APEs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Performance with Different Condensing Degrees</head><p>To assess the effectiveness of the novel adaptive execution flow designed for ACES, we conduct a performance comparison across three static condensing degrees: none (No), aggressive (Ag), and moderate (Mo), complemented by adaptive condensing (Ad). For each condensing degree, we implement two distinct cache replacement policies: the traditional Least Recently Used (LRU) policy and the PureFiber (PF) policy. Figure <ref type="figure">10</ref> depicts the overall speedup of various ACES implementations over SpArch, highlighting the performance for each specific combination of condensing degrees and cache replacement policies, such as Ag-LRU (aggressive condensing with LRU policy) and Mo-PF (moderate condensing with PureFiber policy). Importantly, as the original ACES incorporates adaptive condensing with the PureFiber policy, the performance of the Ad-PF combination is distinctly identified as ACES in Figure <ref type="figure">10</ref>.</p><p>The results presented in Figure <ref type="figure">10</ref> demonstrate the superior performance of adaptive condensing when compared with static condensing configurations under different cache replacement policies. Specifically, with the traditional LRU policy, adaptive condensing (Ad-LRU) achieves a 7.6&#215; speedup over SpArch, outperforming No-LRU, Ag-LRU, and Mo-LRU by 29.6%, 13.8%, and 11.7%, respectively. Similarly, when utilizing the concurrency-aware PureFiber cache replacement policy, adaptive condensing (ACES) achieves an 8.9&#215; speedup over SpArch, surpassing the No-PF, Ag-PF, and Mo-PF configurations by 27.4%, 12.6%, and 7.4%, respectively.</p><p>Adaptive condensing, whether working with the LRU or the PureFiber cache replacement policy, consistently delivers significant speedup gains over static condensing configurations. This performance advantage highlights the effectiveness of adaptive condensing in dynamically balancing data reuse and parallelism efficiency, providing an optimal execution flow finely tuned to the diverse sparse patterns of matrices. Such adaptability not only enhances the computational efficiency of ACES, but also establishes adaptive condensing as a pivotal feature for its versatility in diverse applications and system configurations. The ability of adaptive condensing to work effectively with various cache policies further underscores its broad applicability and flexibility in different system configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Performance with Different Cache Replacement Policies</head><p>To highlight the innovation of the concurrency-aware cache replacement policy PureFiber in ACES, we compare it with two conventional cache replacement policies, LRU and RD, for managing the global cache. Specifically, the RD policy is designed to focus exclusively on the next request distance, prioritizing the eviction of cache lines that are furthest from being requested again. Both the LRU and RD policies aim to enhance data locality with the primary objective of minimizing cache misses. This comparative analysis is intended to showcase the distinctive advantages and potential of the concurrency-aware cache replacement strategy embodied by PureFiber. Unlike conventional policies that focus mainly on data locality, PureFiber considers both data locality and concurrency, potentially offering a more nuanced approach to cache management.</p><p>Figure <ref type="figure">11</ref> illustrates the overall speedup of various ACES implementations compared with SpArch, emphasizing the performance enhancements achieved with three distinct cache replacement policies: LRU, RD, and PureFiber. These policies are evaluated across three static condensing degrees: none, moderate, and aggressive, providing a comprehensive view of their impact on the performance of ACES. The key observation from Figure <ref type="figure">11</ref> is that across all static condensing degrees, the PureFiber cache replacement policy consistently leads to the highest speedups. Specifically, without condensing, the PureFiber policy (No-PF) achieves a speedup of 6.9&#215;, which exceeds the performance of No-LRU and No-RD by 18.0% and 10.7%, respectively. With aggressive condensing, a 7.9&#215; speedup of PureFiber (Ag-PF) is observed, surpassing Ag-LRU and Ag-RD by 17.2% and 10.0%, respectively. When utilizing the moderate condensing degree, PureFiber (Mo-PF) shows the most significant improvement with an 8.2&#215; speedup, outperforming Mo-LRU and Mo-RD by 20.5% and 10.0%, respectively. These results indicate that utilizing the concurrency-aware PureFiber policy in cache replacement not only offers the best performance in the absence of condensing, but also provides superior results when working with both moderate and aggressive static condensing, underscoring its effectiveness.</p><p>Figure <ref type="figure">12</ref> displays the performance of ACES implementations with adaptive condensing across various cache replacement policies: LRU, RD, and PureFiber. Figure <ref type="figure">12</ref>(a) illustrates the speedup achieved by the original ACES implementation employing the PureFiber policy, as well as the speedup of Ad-RD, both in comparison with Ad-LRU. We observe that, when working with adaptive condensing, the consideration of both data locality and concurrency enables ACES with the PureFiber policy to achieve an average improvement of 15.9% over the LRU baseline. In contrast, the average speedup for Ad-RD, which does not consider data concurrency, is 14.8%. 0 10 20 30 40 8 16 32 64 96 128 Speedup over no NB Buffer (%) Figure 13. Performance of ACES across various NB buffer sizes.</p><p>Figure <ref type="figure">12</ref>(b) offers a detailed analysis of the PureFiber performance. This figure highlights that the PureFiber policy effectively reduces cache stalls compared with other policies. Specifically, ACES with the PureFiber policy, achieves an average cache stall reduction of 21.8% over Ad-LRU, whereas Ad-RD, which considers only reuse distance, sees a cache stall reduction of 13.0%. These observations highlight that although reuse distance considerations can improve the global cache performance, the integration of concurrency awareness into the global cache management can yield additional benefits. By considering both data locality and concurrency, the PureFiber policy substantially diminishes cache stalls, thereby enhancing the overall efficiency of the cache system in SpMM accelerators like ACES.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Performance with Different NB Buffer Sizes</head><p>To showcase the effectiveness of integrating a NB buffer with the global cache in ACES, Figure <ref type="figure">13</ref>  only mitigates global cache stalls due to cache misses but also enables the global cache to support higher concurrency, thereby increasing parallelism. Second, Figure <ref type="figure">13</ref> reveals that the performance of ACES tends to stabilize once the number of subentries of the NB buffer reaches 64. At this point, a substantial performance improvement of 35.8% is observed. Additionally, it is important to consider that a larger NB buffer also introduces additional overhead. Therefore, in balancing performance gains with potential overheads, we set the default number of subentries in the NB buffer for ACES to 64.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Area and Power</head><p>In ACES, the breakdown of the area is detailed in Table <ref type="table">3</ref>, showing a total area of 3.52mm 2 . Similar to other SpMM accelerators, a significant portion of the area is occupied by the global cache. Specifically, the 1MB global cache in ACES constitutes 59.4% of the total area. Compared to existing works like SPADA and SpArch, the global cache capacity in ACES is relatively modest. By integrating a lightweight NB buffer with the global cache, ACES achieves reduced area overhead while improving performance relative to accelerators with larger caches. For comparison, the total areas of SPADA and SpArch at 28 nm technology are 6.32mm 2 and 13.96mm 2 , respectively. Additionally, the power consumption evaluation of ACES reveals that it consumes a total of 2.83W of power.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Additional Related Works</head><p>Accelerating SpMM on CPU and GPU. Prior research has focused on accelerating SpMM on both CPU and GPU architectures <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">34,</ref><ref type="bibr">41,</ref><ref type="bibr">55]</ref>. MKL <ref type="bibr">[55]</ref> offers math routines optimized for parallel computation using OpenMP on CPUs. For GPUs, cuSPARSE <ref type="bibr">[41]</ref> enhances SpMM efficiency by parallelizing computations across matrix rows and using a hash table for merging partial results. CUSP <ref type="bibr">[9]</ref> also adopts a parallel approach but employs a sorting algorithm for merging computations from different rows. SpMM Accelerators. In this work, ACES is compared against three state-of-the-art SpMM accelerators: SIGMA <ref type="bibr">[45]</ref>, SpArch <ref type="bibr">[61]</ref>, and SPADA <ref type="bibr">[32]</ref>. Additionally, a variety of other accelerators have been designed specifically to optimize SpMM <ref type="bibr">[17,</ref><ref type="bibr">22,</ref><ref type="bibr">25,</ref><ref type="bibr">39,</ref><ref type="bibr">43,</ref><ref type="bibr">50,</ref><ref type="bibr">60]</ref>. Most of these accelerators adopt a fixed execution flow: for instance, Ex-Tenor <ref type="bibr">[22]</ref> utilizes an InP execution flow; OuterSPACE <ref type="bibr">[43]</ref> and Spaghetti <ref type="bibr">[25]</ref> adopt an OutP execution flow, while GAMMA <ref type="bibr">[60]</ref> and MatRaptor <ref type="bibr">[50]</ref> employ a ROW execution flow. ACES distinguishes itself by dynamically adjusting its execution flow in response to the sparsity patterns of the input matrices. Moreover, ACES innovatively leverages concurrent data accesses to develop a cache replacement policy specifically designed for the global cache of an SpMM accelerator. These novel features, including the adaptive execution flow and concurrency-aware cache management, significantly accelerate SpMM processing.</p><p>Sparse DNN Accelerators. As the processing of diverse sparse matrices in DNNs becomes increasingly critical, the efficiency of sparse DNN accelerators <ref type="bibr">[13,</ref><ref type="bibr">18,</ref><ref type="bibr">31,</ref><ref type="bibr">39,</ref><ref type="bibr">49,</ref><ref type="bibr">62]</ref> grows more paramount. Flexagon <ref type="bibr">[39]</ref>, a configurable DNN accelerator, adapts its matrix multiplication execution flow to accommodate various DNN layers. It introduces a Merger-Reduction Network (MRN) to modify the execution flow across InP, OutP, and ROW based on o ine analysis of matrix dimensions and sparsity patterns. CANDLES <ref type="bibr">[18]</ref> adopts a channel-first architecture that traverses activations and weights to enhance the temporal locality of partial sums updated for an output neuron. It also incorporates a pixelfirst compression method where matrices are segmented into groups for compression, storing non-zero values in tiles. Unlike Flexagon <ref type="bibr">[39]</ref>, which determines the execution flow through o ine analysis, and CANDLES <ref type="bibr">[18]</ref>, which employs a static strategy, ACES dynamically adjusts its execution flow in response to the sparsity patterns of incoming matrices. This real-time adaptability allows ACES to balance parallel computing efficiency with data reuse, while simultaneously reducing overhead.</p><p>Systolic Array-based Sparse DNN Accelerators. The success of the Tensor Processing Unit (TPU) <ref type="bibr">[27]</ref>, which introduces the systolic array architecture, has inspired many recent systolic array-based DNN accelerators <ref type="bibr">[11,</ref><ref type="bibr">16,</ref><ref type="bibr">21,</ref><ref type="bibr">30,</ref><ref type="bibr">56]</ref>. In order to enhance the utilization and computational efficiency of systolic arrays in handling matrices with diverse sparsity and sizes, column packing has been proposed. Kung et al. <ref type="bibr">[30]</ref> address this by grouping the columns of a sparse matrix, combining multiple sparse columns into a single dense column, and mapping each group to a single column of the systolic array, significantly improving utilization efficiency. Inspired by <ref type="bibr">[30]</ref>, Sparse-TPU <ref type="bibr">[21]</ref> introduces partition-wise packing, which divides the matrix into bands and packs columns within each band to minimize data conflicts. Moreover, Sparse-TPU employs a collision-aware algorithm to enhance the packing density, even in the presence of conflicts. While column packing significantly enhances the computational efficiency, it requires sparse matrix reordering and is performed o ine, introducing extra pre-processing overhead. Conversely, ACES determines and adjusts adaptive condensing degrees dynamically during runtime, eliminating the need for pre-processing and the associated overheads.</p><p>Concurrent Data Accesses. Modern architectures now widely support data concurrency. For example, out-of-order execution <ref type="bibr">[53]</ref> and simultaneous multithreading <ref type="bibr">[54]</ref> have been instrumental in improving pipeline utilization. Additionally, advancements in memory and cache design like multi-port <ref type="bibr">[63]</ref>, pipelined <ref type="bibr">[1]</ref>, and non-blocking <ref type="bibr">[28]</ref> caches allow more access to coexist within the same cycle, which substantially improves throughput and reduce memory access delays. Concurrent data accesses are utilized for performance modeling <ref type="bibr">[40,</ref><ref type="bibr">51,</ref><ref type="bibr">52]</ref> and optimizations <ref type="bibr">[35,</ref><ref type="bibr">36,</ref><ref type="bibr">38,</ref><ref type="bibr">46,</ref><ref type="bibr">58]</ref>. The C-AMAT model <ref type="bibr">[52]</ref> enhances the traditional AMAT model <ref type="bibr">[23]</ref> by quantitatively evaluating the combined impact of memory access locality and concurrency, accounting for data access overlaps. In the realm of cache management, the MLP-aware cache replacement policy <ref type="bibr">[46]</ref> and the CARE framework <ref type="bibr">[38]</ref> analyze the cost of each cache miss amid multiple concurrent outstanding accesses, guiding cache replacement decisions to reduce stalls and enhance performance. CHROME <ref type="bibr">[35]</ref> offers a holistic solution by integrating cache replacement and bypassing with pattern-based prefetching, applying concurrency-aware system-level feedback to refine decision-making. In contrast to these concurrency-aware cache management studies <ref type="bibr">[35,</ref><ref type="bibr">37,</ref><ref type="bibr">38,</ref><ref type="bibr">46]</ref> that focus on general cache management enhancements, PureFiber is specifically designed for SpMM, leveraging the unique data access patterns of SpMM to optimize global cache usage in accelerators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper, we introduced ACES, an innovative SpMM accelerator. ACES supports an adaptive execution flow, adept at efficiently processing matrices with a wide range of sparse patterns. It also integrates co-optimizations of data locality and concurrency within its global cache, effectively reducing memory stalls and enhancing data access concurrency. The hardware architecture of ACES is meticulously tailored to complement its adaptive execution capabilities and cache optimizations, facilitating fine-granularity parallelism. Our comprehensive evaluations indicate that ACES consistently outperforms current state-of-the-art SpMM accelerators, underscoring its significant potential in enabling efficient computing solutions for diverse applications.</p></div></body>
		</text>
</TEI>
