Abstract We first propose an ultra-compact energy-efficient time-domain vector-by-matrix multiplier (VMM) based on commercial 3D-NAND flash memory structure. The proposed 3D-VMM uses a novel resistive successive integrate and re-scaling (RSIR) scheme to eliminate the stringent requirement of a bulky load capacitor which otherwise dominates the area- and energy-landscape of the conventional time-domain VMMs. Our rigorous analysis, performed at the 55 nm technology node, shows that RSIR-3D-VMM achieves a record-breaking area efficiency of ∼0.02 μ m 2 /Byte and the energy efficiency of ∼6 f J/Op for a 500 × 500 4-bit VMM, representing 5× and 1.3× improvements over the previously reported 3D-VMM approach. Moreover, unlike the previous approach, the proposed VMM can be efficiently tailored to work in a smaller current output range. Our second major contribution is the development of 3D-aCortex, a multi-purpose neuromorphic inference processor that utilizes the proposed 3D-VMM block as its core processing unit. Rigorous performance modeling of the 3D-aCortex targeting several state-of-the-art neural network benchmarks has shown that it may provide a record-breaking 30.7 MB mm −2 storage efficiency, 113.3 TOp/J peak energy efficiency, and 10.66 TOp/s computational throughput. The system-level analysis indicates that the gain in the area-efficiency of RSIR leads to a smaller data transfer delay, which compensates for the reduction in the VMM throughput due to an increased input time window.
more »
« less
Mixed-Signal Vector-by-Matrix Multiplier Circuits Based on 3D-NAND Memories for Neurocomputing
We propose an extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using a time-domain-encoded VMM design. We have performed rigorous simulations of such a circuit, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our results, for example, show that the 4-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 µm2/byte and energy efficiency of ~11 fJ/Op, including the input/output and other peripheral circuitry overheads.
more »
« less
- Award ID(s):
- 1740352
- PAR ID:
- 10192395
- Date Published:
- Journal Name:
- 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)
- Page Range / eLocation ID:
- 696 to 701
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Instant data deletion (or sanitization) in NAND flash devices is essential for achieving data privacy, but it remains challenging due to the mismatch between erase and write granularities, which leads to high overhead and accelerated wear. While page-overwrite-based instant data sanitization has proven effective for 2D NAND, its applicability to 3D NAND is limited due to the unique sub-block architecture. In this study, we experimentally evaluate page-overwrite-based sanitization on commercial 3D NAND flash memory chips and uncover significant threshold voltage disturbances in erased cells on adjacent pages within the same layer but across different sub-blocks. Our key findings reveal that page-overwrite sanitization increases the median raw bit error rate (RBER) beyond correction limits (exceeding 0.93%) in Floating-Gate (FG) Single-Level Cell (SLC) technology, whereas Charge-Trap (CT) SLC 3D NAND flash memories exhibit higher robustness. In Triple-Level Cell (TLC) 3D NAND, page-overwrite sanitization proves impractical, with the median RBER of ∼13% for FG and ∼5% for CT devices. To overcome these challenges, we proposePULSE, a low-disturbance sanitization technique that balances sanitization efficiency ({{\eta }_{san}}) and data integrity (RBER). Experimental results show that PULSE eliminates RBER increases in SLC devices and reduces the median RBER to below 0.57% for FG and 0.79% for CT in fresh TLC blocks, demonstrating its practical viability for 3D NAND flash sanitization.more » « less
-
Vector-matrix multiplication (VMM) is a core operation in many signal and data processing algorithms. Previous work showed that analog multipliers based on nonvolatile memories have superior energy efficiency as compared to digital counterparts at low-to-medium computing precision. In this paper, we propose extremely energy efficient analog mode VMM circuit with digital input/output interface and configurable precision. Similar to some previous work, the computation is performed by gate-coupled circuit utilizing embedded floating gate (FG) memories. The main novelty of our approach is an ultra-low power sensing circuitry, which is designed based on translinear Gilbert cell in topological combination with a floating resistor and a low-gain amplifier. Additionally, the digital-to-analog input conversion is merged with VMM, while current-mode algorithmic analog-to-digital circuit is employed at the circuit backend. Such implementations of conversion and sensing allow for circuit operation entirely in a current domain, resulting in high performance and energy efficiency. For example, post-layout simulation results for 400×400 5-bit VMM circuit designed in 55 nm process with embedded NOR flash memory, show up to 400 MHz operation, 1.68 POps/J energy efficiency, and 39.45 TOps/mm2 computing throughput. Moreover, the circuit is robust against process-voltage-temperature variations, in part due to inclusion of additional FG cells that are utilized for offset compensation.more » « less
-
The Von Neumann bottleneck, a fundamental chal- lenge in conventional computer architecture, arises from the inability to execute fetch and data operations simultaneously due to a shared bus linking processing and memory units. This bottleneck significantly limits system performance, increases energy consumption, and exacerbates computational complex- ity. Emerging technologies such as Resistive Random Access Memories (RRAMs), leveraging crossbar arrays, offer promis- ing alternatives for addressing the demands of data-intensive computational tasks through in-memory computing of analog vector-matrix multiplication (VMM) operations. However, the propagation of errors due to device and circuit-level imperfec- tions remains a significant challenge. In this study, we introduce MELISO (In-Memory Linear Solver), a comprehensive end-to- end VMM benchmarking framework tailored for RRAM-based systems. MELISO evaluates the error propagation in VMM op- erations, analyzing the impact of RRAM device metrics on error magnitude and distribution. This paper introduces the MELISO framework and demonstrates its utility in characterizing and mitigating VMM error propagation using state-of-the-art RRAM device metrics.more » « less
-
Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100x100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.more » « less
An official website of the United States government

