skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 29, 2026

Title: FP-SMR: A Fully Digital Floating-Point Processing-in-SAS-MRAM for Session-based Recommender System
With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively.  more » « less
Award ID(s):
2314591 2505326 2528723 2528767 2503906 2505209
PAR ID:
10616386
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400714962
Page Range / eLocation ID:
341 to 347
Format(s):
Medium: X
Location:
New Orleans LA USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technologyspecific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and isoarea performance and energy analysis for systems whose lastlevel caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2× and 5× energy-delay product (EDP) reduction and 2.4× and 3× area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3× EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications. 
    more » « less
  2. Due to the separate memory and computation units in traditional Von-Neumann architecture, massive data transfer dominates the overall computing system’s power and latency, known as the ‘Memory-Wall’ issue. Especially with ever-increasing deep learning-based AI model size and computing complexity, it becomes the bottleneck for state-of-the-art AI computing systems. To address this challenge, In-Memory Computing (IMC) based Neural Network accelerators have been widely investigated to support AI computing within memory. However, most of those works focus only on inference. The on-device training and continual learning have not been well explored yet. In this work, for the first time, we introduce on-device continual learning with STT-assisted-SOT (SAS) Magnetic Random Access Memory (MRAM) based IMC system. On the hardware side, we have fabricated a SAS-MRAM device prototype with 4 Magnetic Tunnel Junctions (MTJ, each at 100nm × 50nm) sharing a common heavy metal layer, achieving significantly improved memory writing and area efficiency compared to traditional SOT-MRAM. Next, we designed fully digital IMC circuits with our SAS-MRAM to support both neural network inference and on-device learning. To enable efficient on-device continual learning for new task data, we present an 8-bit integer (INT8) based continual learning algorithm that utilizes our SAS-MRAM IMC-supported bit-serial digital in-memory convolution operations to train a small parallel reprogramming Network (Rep-Net) while freezing the major backbone model. Extensive studies have been presented based on our fabricated SAS-MRAM device prototype, cross-layer device-circuit benchmarking and simulation, as well as the on-device continual learning system evaluation. 
    more » « less
  3. The emergence of embedded magnetic random-access memory (MRAM) and its integration in mainstream semiconductor manufacturing technology have created an unprecedented opportunity for engineering computing systems with improved performance, energy efficiency, lower cost, and unconventional computing capabilities. While the initial interest in the existing generation of MRAM—which is based on the spin-transfer torque (STT) effect in ferromagnetic tunnel junctions—was driven by its nonvolatile data retention and lower cost of integration compared to embedded Flash (eFlash), the focus of MRAM research and development efforts is increasingly shifting toward alternative write mechanisms (beyond STT) and new materials (beyond ferromagnets) in recent years. This has been driven by the need for better speed vs density and speed vs endurance trade-offs to make MRAM applicable to a wider range of memory markets, as well as to utilize the potential of MRAM in various unconventional computing architectures that utilize the physics of nanoscale magnets. In this Perspective, we offer an overview of spin–orbit torque (SOT) as one of these beyond-STT write mechanisms for the MRAM devices. We discuss, specifically, the progress in developing SOT-MRAM devices with perpendicular magnetization. Starting from basic symmetry considerations, we discuss the requirement for an in-plane bias magnetic field which has hindered progress in developing practical SOT-MRAM devices. We then discuss several approaches based on structural, magnetic, and chiral symmetry-breaking that have been explored to overcome this limitation and realize bias-field-free SOT-MRAM devices with perpendicular magnetization. We also review the corresponding material- and device-level challenges in each case. We then present a perspective of the potential of these devices for computing and security applications beyond their use in the conventional memory hierarchy. 
    more » « less
  4. While magnetoresistive random-access memory (MRAM) stands out as a leading candidate for embedded nonvolatile memory and last-level cache applications, its endurance is compromised by substantial self-heating due to the high programming current density. The effect of self-heating on the endurance of the magnetic tunnel junction (MTJ) has primarily been studied in spin-transfer torque (STT)-MRAM. Here, we analyze the transient temperature response of two-terminal spin–orbit torque (SOT)-MRAM with a 1 ns switching current pulse using electro-thermal simulations. We estimate a peak temperature range of 350–450 °C in 40 nm diameter MTJs, underscoring the critical need for thermal management to improve endurance. We suggest several thermal engineering strategies to reduce the peak temperature by up to 120 °C in such devices, which could improve their endurance by at least a factor of 1000× at 0.75 V operating voltage. These results suggest that two-terminal SOT-MRAM could significantly outperform conventional STT-MRAM in terms of endurance, substantially benefiting from thermal engineering. These insights are pivotal for thermal optimization strategies in the development of MRAM technologies. 
    more » « less
  5. In this paper, we propose MRIMA, as a novel MRAM-based In-Memory Accelerator for non-volatile, flexible, and efficient in-memory computing. MRIMA transforms current Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) arrays to massively parallel computational units capable of working as both non-volatile memory and in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, MRIMA exploits hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a single clock cycle, overcoming the multi-cycle logic issue in contemporary Processing-In-Memory (PIM) platforms. We present practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width Convolutional Neural Networks (CNN) as well as data encryption. Our device-to-architecture co-simulation results on CNN acceleration demonstrate that MRIMA can obtain 1.7× better energy-efficiency and 11.2× speed-up compared to ASICs, and, 1.8× better energy-efficiency and 2.4× speed-up over the best DRAM-based PIM solutions. As an AES in-memory encryption engine, MRIMA shows 77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain wall-based design, respectively. 
    more » « less