RRAM-based in-memory computing (IMC) effectively accelerates deep neural networks (DNNs) and other machine learning algorithms. On the other hand, in the presence of RRAM device variations and lower precision, the mapping of DNNs to RRAM-based IMC suffers from severe accuracy loss. In this work, we propose a novel hybrid IMC architecture that integrates an RRAM-based IMC macro with a digital SRAM macro using a programmable shifter to compensate for the RRAM variations and recover the accuracy. The digital SRAM macro consists of a small SRAM memory array and an array of multiply-and-accumulate (MAC) units. The non-ideal output from the RRAM macro, due to device and circuit non-idealities, is compensated by adding the precise output from the SRAM macro. In addition, the programmable shifter allows for different scales of compensation by shifting the SRAM macro output relative to the RRAM macro output. On the algorithm side, we develop a framework for the training of DNNs to support the hybrid IMC architecture through ensemble learning. The proposed framework performs quantization (weights and activations), pruning, RRAM IMC-aware training, and employs ensemble learning through different compensation scales by utilizing the programmable shifter. Finally, we design a silicon prototype of the proposed hybrid IMC architecture in the 65nm SUNY process to demonstrate its efficacy. Experimental evaluation of the hybrid IMC architecture shows that the SRAM compensation allows for a realistic IMC architecture with multi-level RRAM cells (MLC) even though they suffer from high variations. The hybrid IMC architecture achieves up to 21.9%, 12.65%, and 6.52% improvement in post-mapping accuracy over state-of-the-art techniques, at minimal overhead, for ResNet-20 on CIFAR-10, VGG-16 on CIFAR-10, and ResNet-18 on ImageNet, respectively.
more »
« less
A 1.23-GHz 16-kb Programmable and Generic Processing-in-SRAM Accelerator in 65nm
We present a generic and programmable Processing-in-SRAM (PSRAM) accelerator chip design based on an 8T-SRAM array to accommodate a complete set of Boolean logic operations (e.g., NOR/NAND/XOR, both 2- and 3-input), majority, and full adder, for the first time, all in a single cycle. PSRAM provides the programmability required for in-memory computing platforms that could be used for various applications such as parallel vector operation, neural networks, and data encryption. The prototype design is implemented in a SRAM macro with size of 16 kb, demonstrating one of the fastest programmable in-memory computing system to date operating at 1.23 GHz. The 65nm prototype chip achieves system-level peak throughput of 1.2 TOPS, and energy-efficiency of 34.98 TOPS/W at 1.2V.
more »
« less
- PAR ID:
- 10389143
- Date Published:
- Journal Name:
- 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)
- Page Range / eLocation ID:
- 153 to 156
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector–matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm 2 ), achieving 3975 × better energy–delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.more » « less
-
System on chips (SoCs) are now designed with their own artificial intelligence (AI) accelerator segment to accommodate the ever-increasing demand of deep learning (DL) applications. With powerful multiply and accumulate (MAC) engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with design technology co-optimization (DTCO)-enabled customized spin-orbit torque (SOT)-MRAM as large on-chip memory through system technology co-optimization (STCO) and detailed characterization of the DL workloads. Our workload-aware memory system achieves 8× energy and 9× latency improvement on computer vision (CV) benchmarks in training and 8× energy and 4.5× latency improvement on natural language processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.more » « less
-
This work presents the first resistive random access memory (RRAM)-based compute-in-memory (CIM) macro design tailored for genome processing. We analyze and demonstrate two key types of genome processing applications using our developed CIM chip prototype: the state-of-the-art (SOTA) burrows–wheeler transform (BWT)-based DNA short- read alignment and alignment-free mRNA quantification. Our CIM macro is designed and optimized to support the major functions essential to these algorithms, e.g., parallel XNOR operations, count, addition, and parallel bit-wise and operations. The proposed CIM macro prototype is fabricated with monolithic integration of HfO2 RRAM and 65-nm CMOS, achieving 2.07 TOPS/W (tera-operations per second per watt) and 2.12 G suffixes/J (suffixes per joule) at 1.0 V, which is the most energy-efficient solution to date for genome processing.more » « less
-
As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250, 2.32x (FP32) than Nvidia Jetson TX2, 1.06x (FP32), 1.70x (INT8) than Nvidia A100.more » « less
An official website of the United States government

