skip to main content


Title: DIMC: 2219TOPS/W 2569F2/b Digital In-Memory Computing Macro in 28nm Based on Approximate Arithmetic Hardware
In-memory-computing (IMC) SRAM architecture has gained significant attention as it achieves high energy efficiency for computing a convolutional neural network (CNN) model [1]. Recent works investigated the use of analog-mixed-signal (AMS) hardware for high area and energy efficiency [2], [3]. However, AMS hardware output is well known to be susceptible to process, voltage, and temperature (PVT) variations, limiting the computing precision and ultimately the inference accuracy of a CNN. We reconfirmed, through the simulation of a capacitor-based IMC SRAM macro that computes a 256D binary dot product, that the AMS computing hardware has a significant root-mean-square error (RMSE) of 22.5% across the worst-case voltage, temperature (Fig. 16.1.1 top left) and 3-sigma process variations (Fig. 16.1.1 top right). On the other hand, we can implement an IMC SRAM macro using robust digital logic [4], which can virtually eliminate the variability issue (Fig. 16.1.1 top). However, digital circuits require more devices than AMS counterparts (e.g., 28 transistors for a mirror full adder [FA]). As a result, a recent digital IMC SRAM shows a lower area efficiency of 6368F2/b (22nm, 4b/4b weight/activation) [5] than the AMS counterpart (1170F2/b, 65nm, 1b/1b) [3]. In light of this, we aim to adopt approximate arithmetic hardware to improve area and power efficiency and present two digital IMC macros (DIMC) with different levels of approximation (Fig. 16.1.1 bottom left). Also, we propose an approximation-aware training algorithm and a number format to minimize inference accuracy degradation induced by approximate hardware (Fig. 16.1.1 bottom right). We prototyped a 28nm test chip: for a 1b/1b CNN model for CIFAR-10 and across 0.5-to-1.1V supply, the DIMC with double-approximate hardware (DIMC-D) achieves 2569F2/b, 932-2219TOPS/W, 475-20032GOPS, and 86.96% accuracy, while for a 4b/1b CNN model, the DIMC with the single-approximate hardware (DIMC-S) achieves 3814F2/b, 458-990TOPS/W  more » « less
Award ID(s):
1919147
NSF-PAR ID:
10342205
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
2022 IEEE International Solid- State Circuits Conference (ISSCC)
Page Range / eLocation ID:
266 to 268
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector–matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm 2 ), achieving 3975 × better energy–delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff. 
    more » « less
  2. RRAM-based in-memory computing (IMC) effectively accelerates deep neural networks (DNNs) and other machine learning algorithms. On the other hand, in the presence of RRAM device variations and lower precision, the mapping of DNNs to RRAM-based IMC suffers from severe accuracy loss. In this work, we propose a novel hybrid IMC architecture that integrates an RRAM-based IMC macro with a digital SRAM macro using a programmable shifter to compensate for the RRAM variations and recover the accuracy. The digital SRAM macro consists of a small SRAM memory array and an array of multiply-and-accumulate (MAC) units. The non-ideal output from the RRAM macro, due to device and circuit non-idealities, is compensated by adding the precise output from the SRAM macro. In addition, the programmable shifter allows for different scales of compensation by shifting the SRAM macro output relative to the RRAM macro output. On the algorithm side, we develop a framework for the training of DNNs to support the hybrid IMC architecture through ensemble learning. The proposed framework performs quantization (weights and activations), pruning, RRAM IMC-aware training, and employs ensemble learning through different compensation scales by utilizing the programmable shifter. Finally, we design a silicon prototype of the proposed hybrid IMC architecture in the 65nm SUNY process to demonstrate its efficacy. Experimental evaluation of the hybrid IMC architecture shows that the SRAM compensation allows for a realistic IMC architecture with multi-level RRAM cells (MLC) even though they suffer from high variations. The hybrid IMC architecture achieves up to 21.9%, 12.65%, and 6.52% improvement in post-mapping accuracy over state-of-the-art techniques, at minimal overhead, for ResNet-20 on CIFAR-10, VGG-16 on CIFAR-10, and ResNet-18 on ImageNet, respectively. 
    more » « less
  3. Resonant tunneling diodes (RTDs) have come full-circle in the past 10 years after their demonstration in the early 1990s as the fastest room-temperature semiconductor oscillator, displaying experimental results up to 712 GHz and fmax values exceeding 1.0 THz [1]. Now the RTD is once again the preeminent electronic oscillator above 1.0 THz and is being implemented as a coherent source [2] and a self-oscillating mixer [3], amongst other applications. This paper concerns RTD electroluminescence – an effect that has been studied very little in the past 30+ years of RTD development, and not at room temperature. We present experiments and modeling of an n-type In0.53Ga0.47As/AlAs double-barrier RTD operating as a cross-gap light emitter at ~300K. The MBE-growth stack is shown in Fig. 1(a). A 15-μm-diam-mesa device was defined by standard planar processing including a top annular ohmic contact with a 5-μm-diam pinhole in the center to couple out enough of the internal emission for accurate free-space power measurements [4]. The emission spectra have the behavior displayed in Fig. 1(b), parameterized by bias voltage (VB). The long wavelength emission edge is at  = 1684 nm - close to the In0.53Ga0.47As bandgap energy of Ug ≈ 0.75 eV at 300 K. The spectral peaks for VB = 2.8 and 3.0 V both occur around  = 1550 nm (h = 0.75 eV), so blue-shifted relative to the peak of the “ideal”, bulk InGaAs emission spectrum shown in Fig. 1(b) [5]. These results are consistent with the model displayed in Fig. 1(c), whereby the broad emission peak is attributed to the radiative recombination between electrons accumulated on the emitter side, and holes generated on the emitter side by interband tunneling with current density Jinter. The blue-shifted main peak is attributed to the quantum-size effect on the emitter side, which creates a radiative recombination rate RN,2 comparable to the band-edge cross-gap rate RN,1. Further support for this model is provided by the shorter wavelength and weaker emission peak shown in Fig. 1(b) around = 1148 nm. Our quantum mechanical calculations attribute this to radiative recombination RR,3 in the RTD quantum well between the electron ground-state level E1,e, and the hole level E1,h. To further test the model and estimate quantum efficiencies, we conducted optical power measurements using a large-area Ge photodiode located ≈3 mm away from the RTD pinhole, and having spectral response between 800 and 1800 nm with a peak responsivity of ≈0.85 A/W at  =1550 nm. Simultaneous I-V and L-V plots were obtained and are plotted in Fig. 2(a) with positive bias on the top contact (emitter on the bottom). The I-V curve displays a pronounced NDR region having a current peak-to-valley current ratio of 10.7 (typical for In0.53Ga0.47As RTDs). The external quantum efficiency (EQE) was calculated from EQE = e∙IP/(∙IE∙h) where IP is the photodiode dc current and IE the RTD current. The plot of EQE is shown in Fig. 2(b) where we see a very rapid rise with VB, but a maximum value (at VB= 3.0 V) of only ≈2×10-5. To extract the internal quantum efficiency (IQE), we use the expression EQE= c ∙i ∙r ≡ c∙IQE where ci, and r are the optical-coupling, electrical-injection, and radiative recombination efficiencies, respectively [6]. Our separate optical calculations yield c≈3.4×10-4 (limited primarily by the small pinhole) from which we obtain the curve of IQE plotted in Fig. 2(b) (right-hand scale). The maximum value of IQE (again at VB = 3.0 V) is 6.0%. From the implicit definition of IQE in terms of i and r given above, and the fact that the recombination efficiency in In0.53Ga0.47As is likely limited by Auger scattering, this result for IQE suggests that i might be significantly high. To estimate i, we have used the experimental total current of Fig. 2(a), the Kane two-band model of interband tunneling [7] computed in conjunction with a solution to Poisson’s equation across the entire structure, and a rate-equation model of Auger recombination on the emitter side [6] assuming a free-electron density of 2×1018 cm3. We focus on the high-bias regime above VB = 2.5 V of Fig. 2(a) where most of the interband tunneling should occur in the depletion region on the collector side [Jinter,2 in Fig. 1(c)]. And because of the high-quality of the InGaAs/AlAs heterostructure (very few traps or deep levels), most of the holes should reach the emitter side by some combination of drift, diffusion, and tunneling through the valence-band double barriers (Type-I offset) between InGaAs and AlAs. The computed interband current density Jinter is shown in Fig. 3(a) along with the total current density Jtot. At the maximum Jinter (at VB=3.0 V) of 7.4×102 A/cm2, we get i = Jinter/Jtot = 0.18, which is surprisingly high considering there is no p-type doping in the device. When combined with the Auger-limited r of 0.41 and c ≈ 3.4×10-4, we find a model value of IQE = 7.4% in good agreement with experiment. This leads to the model values for EQE plotted in Fig. 2(b) - also in good agreement with experiment. Finally, we address the high Jinter and consider a possible universal nature of the light-emission mechanism. Fig. 3(b) shows the tunneling probability T according to the Kane two-band model in the three materials, In0.53Ga0.47As, GaAs, and GaN, following our observation of a similar electroluminescence mechanism in GaN/AlN RTDs (due to strong polarization field of wurtzite structures) [8]. The expression is Tinter = (2/9)∙exp[(-2 ∙Ug 2 ∙me)/(2h∙P∙E)], where Ug is the bandgap energy, P is the valence-to-conduction-band momentum matrix element, and E is the electric field. Values for the highest calculated internal E fields for the InGaAs and GaN are also shown, indicating that Tinter in those structures approaches values of ~10-5. As shown, a GaAs RTD would require an internal field of ~6×105 V/cm, which is rarely realized in standard GaAs RTDs, perhaps explaining why there have been few if any reports of room-temperature electroluminescence in the GaAs devices. [1] E.R. Brown,et al., Appl. Phys. Lett., vol. 58, 2291, 1991. [5] S. Sze, Physics of Semiconductor Devices, 2nd Ed. 12.2.1 (Wiley, 1981). [2] M. Feiginov et al., Appl. Phys. Lett., 99, 233506, 2011. [6] L. Coldren, Diode Lasers and Photonic Integrated Circuits, (Wiley, 1995). [3] Y. Nishida et al., Nature Sci. Reports, 9, 18125, 2019. [7] E.O. Kane, J. of Appl. Phy 32, 83 (1961). [4] P. Fakhimi, et al., 2019 DRC Conference Digest. [8] T. Growden, et al., Nature Light: Science & Applications 7, 17150 (2018). [5] S. Sze, Physics of Semiconductor Devices, 2nd Ed. 12.2.1 (Wiley, 1981). [6] L. Coldren, Diode Lasers and Photonic Integrated Circuits, (Wiley, 1995). [7] E.O. Kane, J. of Appl. Phy 32, 83 (1961). [8] T. Growden, et al., Nature Light: Science & Applications 7, 17150 (2018). 
    more » « less
  4. In this work, a high-speed and energy-efficient comparator-based N ear- S ensor L ocal B inary P attern accelerator architecture (NS-LBP) is proposed to execute a novel local binary pattern deep neural network. First, inspired by recent LBP networks, we design an approximate, hardware-oriented, and multiply-accumulate (MAC)-free network named Ap-LBP for efficient feature extraction, further reducing the computation complexity. Then, we develop NS-LBP as a processing-in-SRAM unit and a parallel in-memory LBP algorithm to process images near the sensor in a cache, remarkably reducing the power consumption of data transmission to an off-chip processor. Our circuit-to-application co-simulation results on MNIST and SVHN datasets demonstrate minor accuracy degradation compared to baseline CNN and LBP-network models, while NS-LBP achieves 1.25 GHz and an energy-efficiency of 37.4 TOPS/W. NS-LBP reduces energy consumption by 2.2× and execution time by a factor of 4× compared to the best recent LBP-based networks. 
    more » « less
  5. In-memory computing (IMC) provides energy- efficient solutions to deep neural networks (DNN). Most IMC de- signs for DNNs employ fixed-point precisions. However, floating- point precision is still required for DNN training and complex inference models to maintain high accuracy. There have not been float-point precision based IMC works in the literature where the float-point computation is immersed into the weight memory storage. In this work, we propose a novel floating-point precision IMC macro with a configurable architecture that supports both normal 8-bit floating point (FP8) and 8-bit block floating point (BF8) with a shared exponent. The proposed FP-IMC macro implemented in 28nm CMOS demonstrates 12.1 TOPS/W for FP8 precision and 66.6 TOPS/W for BF8 precision, improving energy-efficiency beyond the state-of-the-art FP IMC macros. 
    more » « less