skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Processing-in-Memory Designs Based on Emerging Technology for Efficient Machine Learning Acceleration
The unprecedented success of artificial intelligence (AI) enriches machine learning (ML)-based applications. The availability of big data and compute-intensive algorithms empowers versatility and high accuracy in ML approaches. However, the data processing and innumerable computations burden conventional hardware systems with high power consumption and low performance. Breaking away from the traditional hardware design, non-conventional accelerators exploiting emerging technology have gained significant attention with a leap forward since the emerging devices enable processing-in-memory (PIM) designs of dramatic improvement in efficiency. This paper presents a summary of state-of-the-art PIM accelerators over a decade. The PIM accelerators have been implemented for diverse models and advanced algorithm techniques across diverse neural networks in language processing and image recognition to expedite inference and training. We will provide the implemented designs, methodologies, and results, following the development in the past years. The promising direction of the PIM accelerators, vertically stacking for More than Moore, is also discussed.  more » « less
Award ID(s):
2328805 2112562
PAR ID:
10534702
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Page Range / eLocation ID:
614-619
Subject(s) / Keyword(s):
Processing-in-Memory, Accelerators, Emerging Technology, Memristor, Deep Learning
Format(s):
Medium: X
Location:
Clearwater, FL
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this work, we review two alternative Processing-in-Memory (PIM) accelerators based on Spin-Orbit-Torque Magnetic Random Access Memory (SOT-MRAM) to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first discuss the reconstruction of the existing sequence alignment algorithm based on BWT and FM-index such that it can be fully implemented leveraging PIM functions. We then transform SOT-MRAM array to a potential computational memory by presenting two different reconfigurable sense amplifiers to accelerate the reconstructed alignment-in-memory algorithm. The cross-layer simulation results show that such PIM platforms are able to achieve a nearly ten-fold and two-fold increases in throughput/power/area measure compared with recent ASIC and processing-in-ReRAM designs, respectively. 
    more » « less
  2. With the prosperous development of Deep Neural Network (DNNs), numerous Process-In-Memory (PIM) designs have emerged to accelerate DNN models with exceptional throughput and energy-efficiency. PIM accelerators based on Non-Volatile Memory (NVM) or volatile memory offer distinct advantages for computational efficiency and performance. NVM based PIM accelerators, demonstrated success in DNN inference, face limitations in on-device learning due to high write energy, latency, and instability. Conversely, fast volatile memories, like SRAM, offer rapid read/write operations for DNN training, but suffer from significant leakage currents and large memory footprints. In this paper, for the first time, we present a fully-digital sparse processing in hybrid NVM-SRAM design, synergistically combines the strengths of NVM and SRAM, tailored for on-device continual learning. Our designed NVM and SRAM based PIM circuit macros could support both storage and processing of N:M structured sparsity pattern, significantly improving the storage and computing efficiency. Exhaustive experiments demonstrate that our hybrid system effectively reduces area and power consumption while maintaining high accuracy, offering a scalable and versatile solution for on-device continual learning. 
    more » « less
  3. Abstract Solving partial differential equations (PDEs) is the cornerstone of scientific research and development. Data-driven machine learning (ML) approaches are emerging to accelerate time-consuming and computation-intensive numerical simulations of PDEs. Although optical systems offer high-throughput and energy-efficient ML hardware, their demonstration for solving PDEs is limited. Here, we present an optical neural engine (ONE) architecture combining diffractive optical neural networks for Fourier space processing and optical crossbar structures for real space processing to solve time-dependent and time-independent PDEs in diverse disciplines, including Darcy flow equation, the magnetostatic Poisson’s equation in demagnetization, the Navier-Stokes equation in incompressible fluid, Maxwell’s equations in nanophotonic metasurfaces, and coupled PDEs in a multiphysics system. We numerically and experimentally demonstrate the capability of the ONE architecture, which not only leverages the advantages of high-performance dual-space processing for outperforming traditional PDE solvers and being comparable with state-of-the-art ML models but also can be implemented using optical computing hardware with unique features of low-energy and highly parallel constant-time processing irrespective of model scales and real-time reconfigurability for tackling multiple tasks with the same architecture. The demonstrated architecture offers a versatile and powerful platform for large-scale scientific and engineering computations. 
    more » « less
  4. As data‐intensive applications increasingly strain conventional computing systems, processing‐in‐memory (PIM) has emerged as a promising paradigm to alleviate the memory wall by minimizing data transfer between memory and processing units. This review presents the recent advances in both stateful and non‐stateful logic techniques for PIM, focusing on emerging nonvolatile memory technologies such as resistive random‐access memory (RRAM), phase‐change memory (PCM), and magnetoresistive random‐access memory (MRAM). Both experimentally demonstrated and simulated logic designs are critically examined, highlighting key challenges in reliability and the role of device‐level optimization in enabling scalable and commercial viable PIM systems. The review begins with an overview of relevant logic families, memristive device types, and associated reliability metrics. Each logic family is then explored in terms of how it capitalizes on distinct device properties to implement logic techniques. A comparative table of representative device stacks and performance parameters illustrates trade‐offs and quality indicators. Through this comprehensive analysis, the development of optimized, robust memristive devices for next‐generation PIM applications is supported. 
    more » « less
  5. null (Ed.)
    In this paper, for the first time, we propose a high-throughput and energy-efficient Processing-in-DRAM-accelerated genome assembler called PIM-Assembler based on an optimized and hardware-friendly genome assembly algorithm. PIM-Assembler can assemble large-scale DNA sequence dataset from all-pair overlaps. We first develop PIM-Assembler platform that harnesses DRAM as computational memory and transforms it to a fundamental processing unit for genome assembly. PIM-Assembler can perform efficient X(N)OR-based operations inside DRAM incurring low cost on top of commodity DRAM designs (~5% of chip area). PIM-Assembler is then optimized through a correlated data partitioning and mapping methodology that allows local storage and processing of DNA short reads to fully exploit the genome assembly algorithm-level's parallelism. The simulation results show that PIM-Assembler achieves on average 8.4× and 2.3 wise× higher throughput for performing bulk bit-XNOR-based comparison operations compared with CPU and recent processing-in-DRAM platforms, respectively. As for comparison/addition-extensive genome assembly application, it reduces the execution time and power by ~5× and ~ 7.5× compared to GPU. 
    more » « less