Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self- attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling ecient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN- 3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to 1.3X higher throughput and 21.9X better energy efficiency for smaller models, and 3X throughput and 7X energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.
more »
« less
A DRAM-based Near-Memory Architecture for Accelerated and Energy-Efficient Execution of Transformers
Transformers-based language models have achieved remarkable accuracy in various NLP tasks, employing self-attention mecha- nisms primarily based on matrix multiplication. However, their significant size leads to data movement issues, causing latency and energy efficiency challenges in conventional Von-Neumann systems. To mitigate these issues, several in-memory and near- memory architectures have been proposed. This paper introduces PACT-3D, a near-memory architecture featuring novel computing units integrated with DRAM banks. PACT-3D significantly reduces latency by 1.7× and improves energy efficiency by 18.7× compared to state-of-the-art near-memory architectures.
more »
« less
- PAR ID:
- 10519580
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400706059
- Page Range / eLocation ID:
- 57 to 62
- Subject(s) / Keyword(s):
- In/Near-memory Processing, LLMs, Transformers, DRAM, Memory Wall, Energy Efficiency
- Format(s):
- Medium: X
- Location:
- Clearwater FL USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge. Network-on-Chip (NoC)- based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)- enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SM) and the memory controllers (MC) follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate Near Data Processing (NDP) to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.more » « less
-
Today’s Deep Neural Network (DNN) inference systems contain hundreds of billions of parameters, resulting in significant latency and energy overheads during inference due to frequent data transfers between compute and memory units. Processing-in-Memory (PiM) has emerged as a viable solution to tackle this problem by avoiding the expensive data movement. PiM approaches based on electrical devices suffer from throughput and energy efficiency issues. In contrast, Optically-addressed Phase Change Memory (OPCM) operates with light and achieves much higher throughput and energy efficiency compared to its electrical counterparts. This paper introduces a system-level design that takes the OPCM programming overhead into consideration, and identifies that the programming cost dominates the DNN inference on OPCM-based PiM architectures. We explore the design space of this system and identify the most energy-efficient OPCM array size and batch size. We propose a novel thresholding and reordering technique on the weight blocks to further reduce the programming overhead. Combining these optimizations, our approach achieves up to 65.2x higher throughput than existing photonic accelerators for practical DNN workloads.more » « less
-
Graph application workloads are dominated by random memory accesses with poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM architecture designs have focused on mapping graph computations into a set of multiply-and-accumulate (MAC) operations. ReRAMs also offer a key advantage in reducing memory latency between cores and memory by allowing for processing-in-memory (PIM). However, when implemented on a ReRAM-based manycore architecture, graph applications still pose two key challenges – significant storage requirements (particularly due to wasted zero cell storage), and significant amount of on-chip traffic. To tackle these two challenges, in this paper we propose the design of a 3D NoC-enabled ReRAM-based manycore architecture. Our proposed architecture incorporates a novel crossbar-aware node reordering to reduce ReRAM storage requirements. Secondly, its 3D NoC-enabled design reduces on-chip communication latency. Our architecture outperforms the state-of-the-art in ReRAM-based graph acceleration by up to 5x in performance while consuming up to 10.3x less energy for a range of graph inputs and workloads.more » « less
-
Spiking Neural Networks (SNNs) are energy-efficient artificial neural network models that can carry out data-intensive applications. Energy consumption, latency, and memory bottleneck are some of the major issues that arise in machine learning applications due to their data-demanding nature. Memristor-enabled Computing-In-Memory (CIM) architectures have been able to tackle the memory wall issue, eliminating the energy and time-consuming movement of data. In this work we develop a scalable CIM-based SNN architecture with our fabricated two-layer memristor crossbar array. In addition to having an enhanced heat dissipation capability, our memristor exhibits substantial enhancement of 10% to 66% in design area, power and latency compared to state-of-the-art memristors. This design incorporates an inter-spike interval (ISI) encoding scheme due to its high information density to convert the incoming input signals into spikes. Furthermore, we include a time-to-first-spike (TTFS) based output processing stage for its energy-efficiency to carry out the final classification. With the combination of ISI, CIM and TTFS, this network has a competitive inference speed of 2μs/image and can successfully classify handwritten digits with 2.9mW of power and 2.51pJ energy per spike. The proposed architecture with the ISI encoding scheme can achieve ∼10% higher accuracy than those of other encoding schemes in the MNIST dataset.more » « less
An official website of the United States government

