Sparse matrices are very common types of information used
in scientific and machine learning applications including deep
neural networks. Sparse data representations lead to storage
efficiencies by avoiding storing zero values. However, sparse
representations incur metadata computational overheads – soft-
ware first needs to find row/column locations of non-zero val-
ues before performing necessary computations. Such metadata
accesses involve indirect memory accesses (of the form a[b[i]]
where a[.] and b[.] are large arrays) and they are cache and
prefetch-unfriendly, resulting in frequent load stalls.
In this paper, we will explore a dedicated hardware for
a memory-side accelerator called Hardware Helper Thread
(HHT) that performs all the necessary index computations
to fetch only the nonzero elements from sparse matrix and
sparse vector and supply those values to the primary core,
creating heterogeneity within a single CPU core. We show
both performance gains and energy savings of HHT for sparse
matrix-dense vector multiplication (SpMV) and sparse matrix-
sparse vector multiplication (SpMSpV). The ASIC HHT shows
average performance gains ranging between 1.7 and 3.5 de-
pending on the sparsity levels, vector-widths used by RISCV
vector instructions and if the Vector (in Matrix-Vector multi-
plication) is sparse or dense. We also show energy savings
of 19% on average when ASIC HHT is used compared to
baseline (for SpMV), and the HHT requires 38.9% of a RISCV
core area
more »
« less
Hardware accelerator thread for unstructured sparse data processing
Sparse matrix-dense vector (SpMV) multiplication is inherent in
most scientific, neural networks and machine learning algorithms.
To efficiently exploit sparsity of data in the SpMV computations,
several compressed data representations have been used. However,
the compressed data representations of sparse date can result in
overheads for locating nonzero values, requiring indirect memory
accesses and increased instruction count and memory access delays.
We call these translations of compressed representations as
metadata processing. We propose a memory-side accelerator for
metadata (or indexing) computations and supplying only the required
nonzero values to the processor, additionally permitting an
overlap of indexing with core computations on nonzero elements.
In this contribution, we target our accelerator for low-end microcontrollers
with very limited memory and processing capabilities.
In this paper we will explore two dedicated ASIC designs of the
proposed accelerator that handles the indexed memory accesses for
compressed sparse row (CSR) format working alongside a simple
RISC-like programmable core. One version of the the accelerator
supplies only vector values corresponding to nonzero matrix values
and the second version supplies both nonzero matrix and matching
vector values for SpMV computations. Our experiments show
speedups ranging between 1.3 and 2.1 times for SpMV for different
levels of sparsities. Our accelerator also results in energy savings
ranging between 15.8% and 52.7% over different matrix sizes, when
compared to the baseline system with primary RISC-V core performing
all computations. We use smaller synthetic matrices with
different sparsities and larger real-world matrices with higher sparsities
(below 1% non-zeros) in our experimental evaluations.
more »
« less
- Award ID(s):
- 1828105
- NSF-PAR ID:
- 10347473
- Date Published:
- Journal Name:
- International Conference on Computer Aided Design
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01X (up to 132.67X) speedup and a 33.82X energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69X to 2.19X speedup and consumes 4.77X to 8.91X less energy. GRAPHR gains a speedup of 1.16X to 4.12X, and is 3.67X to 10.96X more energy efficiency compared to PIM-based architecture.more » « less
-
The record-breaking performance of deep neural networks (DNNs) comes with heavy parameter budgets, which leads to external dynamic random access memory (DRAM) for storage. The prohibitive energy of DRAM accesses makes it nontrivial for DNN deployment on resource-constrained devices, calling for minimizing the movements of weights and data in order to improve the energy efficiency. Driven by this critical bottleneck, we present SmartDeal, a hardware-friendly algorithm framework to trade higher-cost memory storage/access for lower-cost computation, in order to aggressively boost the storage and energy efficiency, for both DNN inference and training. The core technique of SmartDeal is a novel DNN weight matrix decomposition framework with respective structural constraints on each matrix factor, carefully crafted to unleash the hardware-aware efficiency potential. Specifically, we decompose each weight tensor as the product of a small basis matrix and a large structurally sparse coefficient matrix whose nonzero elements are readily quantized to the power-of-2. The resulting sparse and readily quantized DNNs enjoy greatly reduced energy consumption in data movement as well as weight storage, while incurring minimal overhead to recover the original weights thanks to the required sparse bit-operations and cost-favorable computations. Beyond inference, we take another leap to embrace energy-efficient training, by introducing several customized techniques to address the unique roadblocks arising in training while preserving the SmartDeal structures. We also design a dedicated hardware accelerator to fully utilize the new weight structure to improve the real energy efficiency and latency performance. We conduct experiments on both vision and language tasks, with nine models, four datasets, and three settings (inference-only, adaptation, and fine-tuning). Our extensive results show that 1) being applied to inference, SmartDeal achieves up to 2.44x improvement in energy efficiency as evaluated using real hardware implementations and 2) being applied to training, SmartDeal can lead to 10.56x and 4.48x reduction in the storage and the training energy cost, respectively, with usually negligible accuracy loss, compared to state-of-the-art training baselines. Our source codes are available at: https://github.com/VITA-Group/SmartDeal.more » « less
-
null (Ed.)SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.more » « less
-
SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P ! 1 communication overhead is by far the dominant factor despite improved computational performance.more » « less