skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Memory-Based Computing for Energy-Efficient AI: Grand Challenges
The remarkable progress in artificial intelligence (AI) has ushered in a new era characterized by models with billions of parameters, enabling extraordinary capabilities across diverse domains. However, these achievements come at a significant cost in terms of memory and energy consumption. The growing demand for computational resources raises grand challenges for the sustainable development of energy-efficient AI systems. This paper delves into the paradigm of memory-based computing as a promising avenue to address these challenges. By capitalizing on the inherent characteristics of memory and its efficient utilization, memory-based computing offers a novel approach to enhance AI performance while reducing the associated energy costs. Our paper systematically analyzes the multifaceted aspects of this paradigm, highlighting its potential benefits and outlining the challenges it poses. Through an exploration of various methodologies, architectures, and algorithms, we elucidate the intricate interplay between memory utilization, computational efficiency, and AI model complexity. Furthermore, we review the evolving area of hardware and software solutions for memory-based computing, underscoring their implications for achieving energy-efficient AI systems. As AI continues its rapid evolution, identifying the key challenges and insights presented in this paper serve as a foundational guide for researchers striving to navigate the complex field of memory-based computing and its pivotal role in shaping the future of energy-efficient AI.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Date Published:
Page Range / eLocation ID:
1 to 8
Subject(s) / Keyword(s):
["Compute-in-memory","energy-efficiency","deep learning","large language models"]
Medium: X
Dubai, United Arab Emirates
Sponsoring Org:
National Science Foundation
More Like this
  1. By mimicking biomimetic synaptic processes, the success of artificial intelligence (AI) has been astounding with various applications such as driving automation, big data analysis, and natural-language processing.[1-4] Due to a large quantity of data transmission between the separated memory unit and the logic unit, the classical computing system with von Neumann architecture consumes excessive energy and has a significant processing delay.[5] Furthermore, the speed difference between the two units also causes extra delay, which is referred to as the memory wall.[6, 7] To keep pace with the rapid growth of AI applications, enhanced hardware systems that particularly feature an energy-efficient and high-speed hardware system need to be secured. The novel neuromorphic computing system, an in-memory architecture with low power consumption, has been suggested as an alternative to the conventional system. Memristors with analog-type resistive switching behavior are a promising candidate for implementing the neuromorphic computing system since the devices can modulate the conductance with cycles that act as synaptic weights to process input signals and store information.[8, 9]

    The memristor has sparked tremendous interest due to its simple two-terminal structure, including top electrode (TE), bottom electrode (BE), and an intermediate resistive switching (RS) layer. Many oxide materials, including HfO2, Ta2O5, and IGZO, have extensively been studied as an RS layer of memristors. Silicon dioxide (SiO2) features 3D structural conformity with the conventional CMOS technology and high wafer-scale homogeneity, which has benefited modern microelectronic devices as dielectric and/or passivation layers. Therefore, the use of SiO2as a memristor RS layer for neuromorphic computing is expected to be compatible with current Si technology with minimal processing and material-related complexities.

    In this work, we proposed SiO2-based memristor and investigated switching behaviors metallized with different reduction potentials by applying pure Cu and Ag, and their alloys with varied ratios. Heavily doped p-type silicon was chosen as BE in order to exclude any effects of the BE ions on the memristor performance. We previously reported that the selection of TE is crucial for achieving a high memory window and stable switching performance. According to the study which compares the roles of Cu (switching stabilizer) and Ag (large switching window performer) TEs for oxide memristors, we have selected the TE materials and their alloys to engineer the SiO2-based memristor characteristics. The Ag TE leads to a larger memory window of the SiO2memristor, but the device shows relatively large variation and less reliability. On the other hand, the Cu TE device presents uniform gradual switching behavior which is in line with our previous report that Cu can be served as a stabilizer, but with small on/off ratio.[9] These distinct performances with Cu and Ag metallization leads us to utilize a Cu/Ag alloy as the TE. Various compositions of Cu/Ag were examined for the optimization of the memristor TEs. With a Cu/Ag alloying TE with optimized ratio, our SiO2based memristor demonstrates uniform switching behavior and memory window for analog switching applications. Also, it shows ideal potentiation and depression synaptic behavior under the positive/negative spikes (pulse train).

    In conclusion, the SiO2memristors with different metallization were established. To tune the property of RS layer, the sputtering conditions of RS were varied. To investigate the influence of TE selections on switching performance of memristor, we integrated Cu, Ag and Cu/Ag alloy as TEs and compared the switch characteristics. Our encouraging results clearly demonstrate that SiO2with Cu/Ag is a promising memristor device with synaptic switching behavior in neuromorphic computing applications.


    This work was supported by the U.S. National Science Foundation (NSF) Award No. ECCS-1931088. S.L. and H.W.S. acknowledge the support from the Improvement of Measurement Standards and Technology for Mechanical Metrology (Grant No. 22011044) by KRISS.


    [1] Younget al.,IEEE Computational Intelligence Magazine,vol. 13, no. 3, pp. 55-75, 2018.

    [2] Hadsellet al.,Journal of Field Robotics,vol. 26, no. 2, pp. 120-144, 2009.

    [3] Najafabadiet al.,Journal of Big Data,vol. 2, no. 1, p. 1, 2015.

    [4] Zhaoet al.,Applied Physics Reviews,vol. 7, no. 1, 2020.

    [5] Zidanet al.,Nature Electronics,vol. 1, no. 1, pp. 22-29, 2018.

    [6] Wulfet al.,SIGARCH Comput. Archit. News,vol. 23, no. 1, pp. 20–24, 1995.

    [7] Wilkes,SIGARCH Comput. Archit. News,vol. 23, no. 4, pp. 4–6, 1995.

    [8] Ielminiet al.,Nature Electronics,vol. 1, no. 6, pp. 333-343, 2018.

    [9] Changet al.,Nano Letters,vol. 10, no. 4, pp. 1297-1301, 2010.

    [10] Qinet al., Physica Status Solidi (RRL) - Rapid Research Letters, pssr.202200075R1, In press, 2022.

    more » « less
  2. Abstract

    Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, calledGiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5$$\times$$×speed improvement over its CPU-only predecessor, HiCOPS, and over 10$$\times$$×improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at:

    more » « less
  3. Abstract

    A neuromorphic computing system may be able to learn and perform a task on its own by interacting with its surroundings. Combining such a chip with complementary metal–oxide–semiconductor (CMOS)‐based processors can potentially solve a variety of problems being faced by today's artificial intelligence (AI) systems. Although various architectures purely based on CMOS are designed to maximize the computing efficiency of AI‐based applications, the most fundamental operations including matrix multiplication and convolution heavily rely on the CMOS‐based multiply–accumulate units which are ultimately limited by the von Neumann bottleneck. Fortunately, many emerging memory devices can naturally perform vector matrix multiplication directly utilizing Ohm's law and Kirchhoff's law when an array of such devices is employed in a cross‐bar architecture. With certain dynamics, these devices can also be used either as synapses or neurons in a neuromorphic computing system. This paper discusses various emerging nanoscale electronic devices that can potentially reshape the computing paradigm in the near future.

    more » « less
  4. Channel decoders are key computing modules in wired/wireless communication systems. Recently neural network (NN)-based decoders have shown their promising error-correcting performance because of their end-to-end learning capability. However, compared with the traditional approaches, the emerging neural belief propagation (NBP) solution suffers higher storage and computational complexity, limiting its hardware performance. To address this challenge and develop a channel decoder that can achieve high decoding performance and hardware performance simultaneously, in this paper we take a first step towards exploring SRAM-based in-memory computing for efficient NBP channel decoding. We first analyze the unique sparsity pattern in the NBP processing, and then propose an efficient and fully Digital Sparse In-Memory Matrix vector Multiplier (DSPIMM) computing platform. Extensive experiments demonstrate that our proposed DSPIMM achieves significantly higher energy efficiency and throughput than the state-of-the-art counterparts. 
    more » « less
  5. Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. Thus, model compression becomes a crucial problem. However, the current approaches are limited by various drawbacks. Specifically, network sparsification approach suffers from irregularity, heuristic nature and large indexing overhead. On the other hand, the recent structured matrix-based approach (i.e., CIRCNN) is limited by the relatively complex arithmetic computation (i.e., FFT), less flexible compression ratio, and its inability to fully utilize input sparsity. To address these drawbacks, this paper proposes PERMDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models using permuted diagonal matrices. Compared with unstructured sparsification approach, PERMDNN eliminates the drawbacks of indexing overhead, nonheuristic compression effects and time-consuming retraining. Compared with circulant structure-imposing approach, PERMDNN enjoys the benefits of higher reduction in computational complexity, flexible compression ratio, simple arithmetic computation and full utilization of input sparsity. We propose PERMDNN architecture, a multi-processing element (PE) fully connected (FC) layer-targeted computing engine. The entire architecture is highly scalable and flexible, and hence it can support the needs of different applications with different model configurations. We implement a 32-PE design using CMOS 28nm technology. Compared with EIE, PERMDNN achieves 3:3x-4:8x higher throughout, 5:9x-8:5x better area efficiency and 2:8x-4:0x better energy efficiency on different workloads. Compared with CIRCNN, PERMDNN achieves 11:51x higher throughput and 3:89x better energy efficiency. 
    more » « less