skip to main content

Title: C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism
This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector–matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm 2 ), achieving 3975 × better energy–delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.
; ; ;
Award ID(s):
1919147 1652866
Publication Date:
Journal Name:
IEEE Journal of Solid-State Circuits
Page Range or eLocation-ID:
1888 - 1897
Sponsoring Org:
National Science Foundation
More Like this
  1. In-memory-computing (IMC) SRAM architecture has gained significant attention as it achieves high energy efficiency for computing a convolutional neural network (CNN) model [1]. Recent works investigated the use of analog-mixed-signal (AMS) hardware for high area and energy efficiency [2], [3]. However, AMS hardware output is well known to be susceptible to process, voltage, and temperature (PVT) variations, limiting the computing precision and ultimately the inference accuracy of a CNN. We reconfirmed, through the simulation of a capacitor-based IMC SRAM macro that computes a 256D binary dot product, that the AMS computing hardware has a significant root-mean-square error (RMSE) of 22.5% across the worst-case voltage, temperature (Fig. 16.1.1 top left) and 3-sigma process variations (Fig. 16.1.1 top right). On the other hand, we can implement an IMC SRAM macro using robust digital logic [4], which can virtually eliminate the variability issue (Fig. 16.1.1 top). However, digital circuits require more devices than AMS counterparts (e.g., 28 transistors for a mirror full adder [FA]). As a result, a recent digital IMC SRAM shows a lower area efficiency of 6368F2/b (22nm, 4b/4b weight/activation) [5] than the AMS counterpart (1170F2/b, 65nm, 1b/1b) [3]. In light of this, we aim to adopt approximate arithmetic hardware tomore »improve area and power efficiency and present two digital IMC macros (DIMC) with different levels of approximation (Fig. 16.1.1 bottom left). Also, we propose an approximation-aware training algorithm and a number format to minimize inference accuracy degradation induced by approximate hardware (Fig. 16.1.1 bottom right). We prototyped a 28nm test chip: for a 1b/1b CNN model for CIFAR-10 and across 0.5-to-1.1V supply, the DIMC with double-approximate hardware (DIMC-D) achieves 2569F2/b, 932-2219TOPS/W, 475-20032GOPS, and 86.96% accuracy, while for a 4b/1b CNN model, the DIMC with the single-approximate hardware (DIMC-S) achieves 3814F2/b, 458-990TOPS/W« less
  2. RRAM-based in-memory computing (IMC) effectively accelerates deep neural networks (DNNs) and other machine learning algorithms. On the other hand, in the presence of RRAM device variations and lower precision, the mapping of DNNs to RRAM-based IMC suffers from severe accuracy loss. In this work, we propose a novel hybrid IMC architecture that integrates an RRAM-based IMC macro with a digital SRAM macro using a programmable shifter to compensate for the RRAM variations and recover the accuracy. The digital SRAM macro consists of a small SRAM memory array and an array of multiply-and-accumulate (MAC) units. The non-ideal output from the RRAM macro, due to device and circuit non-idealities, is compensated by adding the precise output from the SRAM macro. In addition, the programmable shifter allows for different scales of compensation by shifting the SRAM macro output relative to the RRAM macro output. On the algorithm side, we develop a framework for the training of DNNs to support the hybrid IMC architecture through ensemble learning. The proposed framework performs quantization (weights and activations), pruning, RRAM IMC-aware training, and employs ensemble learning through different compensation scales by utilizing the programmable shifter. Finally, we design a silicon prototype of the proposed hybrid IMC architecturemore »in the 65nm SUNY process to demonstrate its efficacy. Experimental evaluation of the hybrid IMC architecture shows that the SRAM compensation allows for a realistic IMC architecture with multi-level RRAM cells (MLC) even though they suffer from high variations. The hybrid IMC architecture achieves up to 21.9%, 12.65%, and 6.52% improvement in post-mapping accuracy over state-of-the-art techniques, at minimal overhead, for ResNet-20 on CIFAR-10, VGG-16 on CIFAR-10, and ResNet-18 on ImageNet, respectively.« less
  3. Vector-matrix multiplication (VMM) is a core operation in many signal and data processing algorithms. Previous work showed that analog multipliers based on nonvolatile memories have superior energy efficiency as compared to digital counterparts at low-to-medium computing precision. In this paper, we propose extremely energy efficient analog mode VMM circuit with digital input/output interface and configurable precision. Similar to some previous work, the computation is performed by gate-coupled circuit utilizing embedded floating gate (FG) memories. The main novelty of our approach is an ultra-low power sensing circuitry, which is designed based on translinear Gilbert cell in topological combination with a floating resistor and a low-gain amplifier. Additionally, the digital-to-analog input conversion is merged with VMM, while current-mode algorithmic analog-to-digital circuit is employed at the circuit backend. Such implementations of conversion and sensing allow for circuit operation entirely in a current domain, resulting in high performance and energy efficiency. For example, post-layout simulation results for 400×400 5-bit VMM circuit designed in 55 nm process with embedded NOR flash memory, show up to 400 MHz operation, 1.68 POps/J energy efficiency, and 39.45 TOps/mm2 computing throughput. Moreover, the circuit is robust against process-voltage-temperature variations, in part due to inclusion of additional FG cells thatmore »are utilized for offset compensation.« less
  4. Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes 44.5% of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. Although conventional network quantization techniques reduce the computing overhead of a base-caller by replacing floating-point multiply-accumulations by cheaper fixed-point operations, it significantly increases the number of systematic errors that cannot be corrected by read votes. The power density of prior nonvolatile memory (NVM)-based PIMs has already exceeded memory thermal tolerance even with active heat sinks, because their power efficiency is severely limited by analog-to-digital converters (ADC). Finally, Connectionist Temporal Classification (CTC) decoding and read voting cost 53.7% of total execution time in a quantized base-caller, and thus became its new bottleneck. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecturemore »perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by 6x, throughput per Watt by 11.9x and per mm2 by 7.5x without degrading base-calling accuracy.« less
  5. Latest algorithmic development has brought competitive classification accuracy for neural networks despite constraining the network parameters to ternary or binary representations. These findings show significant optimization opportunities to replace computationally-intensive convolution operations (based on multiplication) with more efficient and less complex operations such as addition. In hardware implementation domain, processing-in-memory architecture is becoming a promising solution to alleviate enormous energy-hungry data communication between memory and processing units, bringing considerable improvement for system performance and energy efficiency while running such large networks. In this paper, we review several of our recent works regarding Processing-in-Memory (PIM) accelerator based on Magnetic Random Access Memory computational sub-arrays to accelerate the inference mode of quantized neural networks using digital non-volatile memory rather than using analog crossbar operation. In this way, we investigate the performance of two distinct in-memory addition schemes compared to other digital methods based on processing-in-DRAM/GPU/ASIC design to tackle DNN power and memory wall bottleneck.