skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: NESTA: Hamming Weight Compression-Based Neural Proc. EngineAli Mirzaeian
In this paper, we present NESTA, a specialized Neural engine that significantly accelerates the computation of convolution layers in a deep convolutional neural network, while reducing the computational energy. NESTA reformats Convolutions into 3 × 3 batches and uses a hierarchy of Hamming Weight Compressors to process each batch. Besides, when processing the convolution across multiple channels, NESTA, rather than computing the precise result of a convolution per channel, quickly computes an approximation of its partial sum, and a residual value such that if added to the approximate partial sum, generates the accurate output. Then, instead of immediately adding the residual, it uses (consumes) the residual when processing the next batch in the hamming weight compressors with available capacity. This mechanism shortens the critical path by avoiding the need to propagate carry signals during each round of computation and speeds up the convolution of each channel. In the last stage of computation, when the partial sum of the last channel is computed, NESTA terminates by adding the residual bits to the approximate output to generate a correct result.  more » « less
Award ID(s):
1718538 2146726
PAR ID:
10182706
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ASP-DAC
Page Range / eLocation ID:
530 to 537
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sparse linear algebra is an important kernel in many different applications. Among various sparse general matrix-matrix multiplication (SpGEMM) algorithms, Gustavson’s column-wise SpGEMM has good locality when reading input matrix and can be easily parallelized by distributing the computation of different columns of an output matrix to different processors. However, the sparse accumulation (SPA) step in column-wise SpGEMM, which merges partial sums from each of the multiplications by the row indices, is still a performance bottleneck. The state-of-the-art software implementation uses a hash table for partial sum search in the SPA, which makes SPA the largest contributor to the execution time of SpGEMM. There are three reasons that cause the SPA to become the bottleneck: (1) hash probing requires data-dependent branches that are difficult for a branch predictor to predict correctly; (2) the accumulation of partial sum is dependent on the results of the hash probing, which makes it difficult to hide the hash probing latency; and (3) hash collision requires time-consuming linear search and optimizations to reduce these collisions require an accurate estimation of the number of non-zeros in each column of the output matrix. This work proposes ASA architecture to accelerate the SPA. ASA overcomes the challenges of SPA by (1) executing the partial sum search and accumulate with a single instruction through ISA extension to eliminate data-dependent branches in hash probing, (2) using a dedicated on-chip cache to perform the search and accumulation in a pipelined fashion, (3) relying on the parallel search capability of a set-associative cache to reduce search latency, and (4) delaying the merging of overflowed entries. As a result, ASA achieves an average of 2.25× and 5.05× speedup as compared to the state-of-the-art software implementation of a Markov clustering application and its SpGEMM kernel, respectively. As compared to a state-of-the-art hashing accelerator design, ASA achieves an average of 1.95× speedup in the SpGEMM kernel. 
    more » « less
  2. We analyze deep ReLU neural networks trained with mini-batch stochastic gradient decent and weight decay. We prove that the source of the SGD noise is an implicit low rank constraint across all of the weight matrices within the network. Furthermore, we show, both theoretically and empirically, that when training a neural network using Stochastic Gradient Descent (SGD) with a small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may include convolutional layers, residual connections, as well as batch normalization layers. 
    more » « less
  3. Network function computation is an active topic in network coding, with much recent progress for linear (over a finite field) computations over broadcast (LCBC) and multiple access (LCMAC) channels. Over a quantum multiple access channel (QMAC) with quantum-entanglement shared among transmitters, the linear computation problem (LC-QMAC) is non-trivial even when the channel is noiseless, because of the challenge of optimally exploiting transmit-side entanglement through distributed coding. Given an arbitrary Fd linear function, f(W1,···,WK) = V1W1 +V2W2 +···+VKWK ∈Fm×1 d , the LC- QMAC problem seeks the optimal communication cost (minimum number of qudits ∆1,··· ,∆K that need to be sent by transmitters Alice1, Alice2, ···, AliceK, respectively, to the receiver, Bob, per computation instance) over a noise-free QMAC, when the input data streams Wk ∈Fmk ×1 d ,k ∈[K], originate at the corresponding transmitters, who share quantum entanglement in advance. As our main result, we fully solve this problem for K = 3 transmitters (K ≥4 settings remain open). Coding schemes based on the N-sum box protocol (along with time-sharing and batch-processing) are shown to be information theoretically optimal in all cases. As an example, in the symmetric case where rk(V1) = rk(V2) = rk(V3) ≜ r1, rk([V1,V2]) = rk([V2,V3]) = rk([V3,V1]) ≜ r2, and rk([V1,V2,V3]) ≜ r3 (rk= rank), the minimum total-download cost is max{1.5r1 + 0.75(r3−r2),r3}. 
    more » « less
  4. In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers. 
    more » « less
  5. Promising for digital signal processing applications, approximate computing has been extensively considered to tradeoff limited accuracy for improvements in other circuit metrics such as area, power, and performance. In this paper, approximate arithmetic circuits are proposed by using emerging nanoscale spintronic devices. Leveraging the intrinsic current-mode thresholding operation of spintronic devices, we initially present a hybrid spin-CMOS majority gate design based on a composite spintronic device structure consisting of a magnetic domain wall motion stripe and a magnetic tunnel junction. We further propose a compact and energy-efficient accuracy-configurable adder design based on the majority gate. Unlike most previous approximate circuit designs that hardwire a constant degree of approximation, this design is adaptive to the inherent resilience in various applications to different degrees of accuracy. Subsequently, we propose two new approximate compressors for utilization in fast multiplier designs. The device-circuit SPICE simulation shows 34.58% and 66% improvement in power consumption, respectively, for the accurate and approximate modes of the accuracy-configurable adder, compared to the recently reported domain wall motion-based full adder design. In addition, the proposed accuracy-configurable adder and approximate compressors can be efficiently utilized in the discrete cosine transform (DCT) as a widely-used digital image processing algorithm. The results indicate that the DCT and inverse DCT (IDCT) using the approximate multiplier achieve ~2x energy saving and 3x speed-up compared to an exactly-designed circuit, while achieving comparable quality in its output result. 
    more » « less