Sparse matrices are very common types of information used in scientific and machine learning applications including deep neural networks. Sparse data representations lead to storage efficiencies by avoiding storing zero values. However, sparse representations incur metadata computational overheads – soft- ware first needs to find row/column locations of non-zero val- ues before performing necessary computations. Such metadata accesses involve indirect memory accesses (of the form a[b[i]] where a[.] and b[.] are large arrays) and they are cache and prefetch-unfriendly, resulting in frequent load stalls. In this paper, we will explore a dedicated hardware for a memory-side accelerator called Hardware Helper Thread (HHT) that performs all the necessary index computations to fetch only the nonzero elements from sparse matrix and sparse vector and supply those values to the primary core, creating heterogeneity within a single CPU core. We show both performance gains and energy savings of HHT for sparse matrix-dense vector multiplication (SpMV) and sparse matrix- sparse vector multiplication (SpMSpV). The ASIC HHT shows average performance gains ranging between 1.7 and 3.5 de- pending on the sparsity levels, vector-widths used by RISCV vector instructions and if the Vector (in Matrix-Vector multi- plication) is sparse or dense. We also show energy savingsmore »
This content will become publicly available on October 30, 2023
Hardware accelerator thread for unstructured sparse data processing
Sparse matrix-dense vector (SpMV) multiplication is inherent in
most scientific, neural networks and machine learning algorithms.
To efficiently exploit sparsity of data in the SpMV computations,
several compressed data representations have been used. However,
the compressed data representations of sparse date can result in
overheads for locating nonzero values, requiring indirect memory
accesses and increased instruction count and memory access delays.
We call these translations of compressed representations as
metadata processing. We propose a memory-side accelerator for
metadata (or indexing) computations and supplying only the required
nonzero values to the processor, additionally permitting an
overlap of indexing with core computations on nonzero elements.
In this contribution, we target our accelerator for low-end microcontrollers
with very limited memory and processing capabilities.
In this paper we will explore two dedicated ASIC designs of the
proposed accelerator that handles the indexed memory accesses for
compressed sparse row (CSR) format working alongside a simple
RISC-like programmable core. One version of the the accelerator
supplies only vector values corresponding to nonzero matrix values
and the second version supplies both nonzero matrix and matching
vector values for SpMV computations. Our experiments show
speedups ranging between 1.3 and 2.1 times for SpMV for different
levels of sparsities. Our accelerator also results in energy savings
ranging between 15.8% and 52.7% over different matrix sizes, when
compared to the baseline system with primary more »
- Award ID(s):
- 1828105
- Publication Date:
- NSF-PAR ID:
- 10347473
- Journal Name:
- International Conference on Computer Aided Design
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true formore »
-
The record-breaking performance of deep neural networks (DNNs) comes with heavy parameter budgets, which leads to external dynamic random access memory (DRAM) for storage. The prohibitive energy of DRAM accesses makes it nontrivial for DNN deployment on resource-constrained devices, calling for minimizing the movements of weights and data in order to improve the energy efficiency. Driven by this critical bottleneck, we present SmartDeal, a hardware-friendly algorithm framework to trade higher-cost memory storage/access for lower-cost computation, in order to aggressively boost the storage and energy efficiency, for both DNN inference and training. The core technique of SmartDeal is a novel DNN weight matrix decomposition framework with respective structural constraints on each matrix factor, carefully crafted to unleash the hardware-aware efficiency potential. Specifically, we decompose each weight tensor as the product of a small basis matrix and a large structurally sparse coefficient matrix whose nonzero elements are readily quantized to the power-of-2. The resulting sparse and readily quantized DNNs enjoy greatly reduced energy consumption in data movement as well as weight storage, while incurring minimal overhead to recover the original weights thanks to the required sparse bit-operations and cost-favorable computations. Beyond inference, we take another leap to embrace energy-efficient training, by introducingmore »
-
SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.
-
SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.