We propose an extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using a time-domain-encoded VMM design. We have performed rigorous simulations of such a circuit, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our results, for example, show that the 4-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 µm2/byte and energy efficiency of ~11 fJ/Op, including the input/output and other peripheral circuitry overheads.
more »
« less
3D-aCortex: an ultra-compact energy-efficient neurocomputing platform based on commercial 3D-NAND flash memories
Abstract We first propose an ultra-compact energy-efficient time-domain vector-by-matrix multiplier (VMM) based on commercial 3D-NAND flash memory structure. The proposed 3D-VMM uses a novel resistive successive integrate and re-scaling (RSIR) scheme to eliminate the stringent requirement of a bulky load capacitor which otherwise dominates the area- and energy-landscape of the conventional time-domain VMMs. Our rigorous analysis, performed at the 55 nm technology node, shows that RSIR-3D-VMM achieves a record-breaking area efficiency of ∼0.02 μ m 2 /Byte and the energy efficiency of ∼6 f J/Op for a 500 × 500 4-bit VMM, representing 5× and 1.3× improvements over the previously reported 3D-VMM approach. Moreover, unlike the previous approach, the proposed VMM can be efficiently tailored to work in a smaller current output range. Our second major contribution is the development of 3D-aCortex, a multi-purpose neuromorphic inference processor that utilizes the proposed 3D-VMM block as its core processing unit. Rigorous performance modeling of the 3D-aCortex targeting several state-of-the-art neural network benchmarks has shown that it may provide a record-breaking 30.7 MB mm −2 storage efficiency, 113.3 TOp/J peak energy efficiency, and 10.66 TOp/s computational throughput. The system-level analysis indicates that the gain in the area-efficiency of RSIR leads to a smaller data transfer delay, which compensates for the reduction in the VMM throughput due to an increased input time window.
more »
« less
- Award ID(s):
- 1740352
- PAR ID:
- 10311388
- Date Published:
- Journal Name:
- Neuromorphic Computing and Engineering
- Volume:
- 1
- Issue:
- 1
- ISSN:
- 2634-4386
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100x100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.more » « less
-
Vector-matrix multiplication (VMM) is a core operation in many signal and data processing algorithms. Previous work showed that analog multipliers based on nonvolatile memories have superior energy efficiency as compared to digital counterparts at low-to-medium computing precision. In this paper, we propose extremely energy efficient analog mode VMM circuit with digital input/output interface and configurable precision. Similar to some previous work, the computation is performed by gate-coupled circuit utilizing embedded floating gate (FG) memories. The main novelty of our approach is an ultra-low power sensing circuitry, which is designed based on translinear Gilbert cell in topological combination with a floating resistor and a low-gain amplifier. Additionally, the digital-to-analog input conversion is merged with VMM, while current-mode algorithmic analog-to-digital circuit is employed at the circuit backend. Such implementations of conversion and sensing allow for circuit operation entirely in a current domain, resulting in high performance and energy efficiency. For example, post-layout simulation results for 400×400 5-bit VMM circuit designed in 55 nm process with embedded NOR flash memory, show up to 400 MHz operation, 1.68 POps/J energy efficiency, and 39.45 TOps/mm2 computing throughput. Moreover, the circuit is robust against process-voltage-temperature variations, in part due to inclusion of additional FG cells that are utilized for offset compensation.more » « less
-
IntroductionParkinson’s disease (PD) is a neurodegenerative disorder affecting millions of patients. Closed-Loop Deep Brain Stimulation (CL-DBS) is a therapy that can alleviate the symptoms of PD. The CL-DBS system consists of an electrode sending electrical stimulation signals to a specific region of the brain and a battery-powered stimulator implanted in the chest. The electrical stimuli in CL-DBS systems need to be adjusted in real-time in accordance with the state of PD symptoms. Therefore, fast and precise monitoring of PD symptoms is a critical function for CL-DBS systems. However, the current CL-DBS techniques suffer from high computational demands for real-time PD symptom monitoring, which are not feasible for implanted and wearable medical devices. MethodsIn this paper, we present an energy-efficient neuromorphic PD symptom detector using memristive three-dimensional integrated circuits (3D-ICs). The excessive oscillation at beta frequencies (13–35 Hz) at the subthalamic nucleus (STN) is used as a biomarker of PD symptoms. ResultsSimulation results demonstrate that our neuromorphic PD detector, implemented with an 8-layer spiking Long Short-Term Memory (S-LSTM), excels in recognizing PD symptoms, achieving a training accuracy of 99.74% and a validation accuracy of 99.52% for a 75%–25% data split. Furthermore, we evaluated the improvement of our neuromorphic CL-DBS detector using NeuroSIM. The chip area, latency, energy, and power consumption of our CL-DBS detector were reduced by 47.4%, 66.63%, 65.6%, and 67.5%, respectively, for monolithic 3D-ICs. Similarly, for heterogeneous 3D-ICs, employing memristive synapses to replace traditional Static Random Access Memory (SRAM) resulted in reductions of 44.8%, 64.75%, 65.28%, and 67.7% in chip area, latency, and power usage. DiscussionThis study introduces a novel approach for PD symptom evaluation by directly utilizing spiking signals from neural activities in the time domain. This method significantly reduces the time and energy required for signal conversion compared to traditional frequency domain approaches. The study pioneers the use of neuromorphic computing and memristors in designing CL-DBS systems, surpassing SRAM-based designs in chip design area, latency, and energy efficiency. Lastly, the proposed neuromorphic PD detector demonstrates high resilience to timing variations in brain neural signals, as confirmed by robustness analysis.more » « less
-
Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self- attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling ecient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN- 3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to 1.3X higher throughput and 21.9X better energy efficiency for smaller models, and 3X throughput and 7X energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.more » « less
An official website of the United States government

