skip to main content

Title: ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for Low-Bit DNN Training
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Design, Automation & Test in Europe Conference & Exhibition (DATE)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep Convolution Neural Network (CNN) has achieved outstanding performance in image recognition over large scale dataset. However, pursuit of higher inference accuracy leads to CNN architecture with deeper layers and denser connections, which inevitably makes its hardware implementation demand more and more memory and computational resources. It can be interpreted as `CNN power and memory wall'. Recent research efforts have significantly reduced both model size and computational complexity by using low bit-width weights, activations and gradients, while keeping reasonably good accuracy. In this work, we present different emerging nonvolatile Magnetic Random Access Memory (MRAM) designs that could be leveraged to implement `bit-wise in-memory convolution engine', which could simultaneously store network parameters and compute low bit-width convolution. Such new computing model leverages the `in-memory computing' concept to accelerate CNN inference and reduce convolution energy consumption due to intrinsic logic-in-memory design and reduction of data communication. 
    more » « less
  2. Sorting is a fundamental function in many applications from data processing to database systems. For high performance, sorting-hardware based sorting designs are implemented by conventional binary or emerging stochastic computing (SC) approaches. Binary designs are fast and energy-efficient but costly to implement. SC-based designs, on the other hand, are area and power-efficient but slow and energy-hungry. So, the previous studies of the hardware-based sorting further faced scalability issues. In this work, we propose a novel scalable low-cost design for implementing sorting networks. We borrow the concept of SC for the area- and power efficiency but use weighted stochastic bit-streams to address the high latency and energy consumption issue of SC designs. A new lock and swap (LAS) unit is proposed to sort weighted bit-streams. The LAS-based sorting network can determine the result of comparing different input values early and then map the inputs to the corresponding outputs based on shorter weighted bit-streams. Experimental results show that the proposed design approach achieves much better hardware scalability than prior work. Especially, as increasing the number of inputs, the proposed scheme can reduce the energy consumption by about 3.8% - 93% compared to prior binary and SC-based designs. 
    more » « less
  3. null (Ed.)
  4. Deep Neural Network (DNN) acceleration with digital Processing-in-Memory (PIM) platforms at the edge is an actively-explored domain with great potential to not only address memory-wall bottlenecks but to offer orders of performance improvement in comparison to the von-Neumann architecture. On the other side, FPGA-based edge computing has been followed as a potential solution to accelerate compute-intensive workloads. In this work, adopting low-bit-width neural networks, we perform a solid and comparative inference performance analysis of a recent processing-in-SRAM tape-out with a low-resource FPGA board and a high-performance GPU to provide a guideline for the research community. We explore and highlight the key architectural constraints of these edge candidates that impact their overall performance. Our experimental data demonstrate that the processing-in-SRAM can obtain up to ~160x speed-up and up to 228x higher efficiency (img/s/W) compared to the under-test FPGA on the CIFAR-10 dataset. 
    more » « less