skip to main content


Title: FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation
In this paper, we propose a Flexible processing-in-DRAM framework named FlexiDRAM that supports the efficient implementation of complex bulk bitwise operations. This framework is developed on top of a new reconfigurable in-DRAM accelerator that leverages the analog operation of DRAM sub-arrays and elevates it to implement XOR2-MAJ3 operations between operands stored in the same bit-line. FlexiDRAM first generates an efficient XOR-MAJ representation of the desired logic and then appropriately allocates DRAM rows to the operands to execute any in-DRAM computation. We develop ISA and software support required to compute in-DRAM operation. FlexiDRAM transforms current memory architecture to a massively parallel computational unit and can be leveraged to significantly reduce the latency and energy consumption of complex workloads. Our extensive circuit-to-architecture simulation results show that averaged across two well-known deep learning workloads, FlexiDRAM achieves ∼15× energy-saving and 13× speedup over the GPU outperforming recent processing-in-DRAM platforms.  more » « less
Award ID(s):
1710009
NSF-PAR ID:
10351290
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Boston, MA, USA, August 1-3, 2022.
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose ReDRAM, as a reconfigurable DRAM-based processing-in-memory (PIM) accelerator, which transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of modern memory chips. ReDRAM uses the analog operation of DRAM sub-arrays and elevates it to implement a full set of 1- and 2-input bulk bit-wise operations (NOT, (N)AND, (N)OR, and even X(N)OR) between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. ReDRAM can be leveraged to greatly reduce energy consumption and latency of complex in-DRAM logic computations relying on state-of-the-art mechanisms based on triple-row activation, dual-contact cells, row initialization, NOR style, etc. The extensive circuit-architecture simulations show that ReDRAM achieves on average 54× and 7.1× higher throughput for performing bulk bit-wise operations compared with CPU and GPU, respectively. Besides, ReDRAM outperforms recent processing-in-DRAM platforms with up to 3.7× better performance. 
    more » « less
  2. With the end of Dennard scaling, power constraints have led to increasing compute specialization in the form of differently specialized accelerators integrated at various levels of the general-purpose system hierarchy. The result is that the most common general-purpose computing platform is now a heterogeneous mix of architectures even within a single die. Consequently, mapping application code regions into available execution engines has become a challenge due to different interfaces and increased software complexity. At the same time, the energy costs of data movement have become increasingly dominant relative to computation energy. This has inspired a move towards data-centric systems, where computation is brought to data, in contrast to traditional processing-centric models. However, enabling compute nearer memory entails its own challenges, including the interactions between distance-specialization and compute-specialization. The granularity of any offload to near(er) memory logic would impact the potential data transmission reduction, as smaller offloads will not be able to amortize the transmission costs of invocation and data return, while very large offloads can only be mapped onto logic that can support all of the necessary operations within kernel-scale codes, which exacerbates both area and power constraints. For better energy efficiency, each set of related operations should be mapped onto the execution engine that, among those capable of running the set of operations, best balances the data movement and the degree of compute specialization of that engine for this code. Further, this offload should proceed in a decentralized way that keeps both the data and control movement low for all transitions among engines and transmissions of operands and results. To enable such a decentralized offload model, we propose an architecture interface that enables a common offload model for accelerators across the memory hierarchy and a tool chain to automatically identify (in a distance-aware fashion) and map profitable code regions on specialized execution engines. We evaluate the proposed architecture for a wide range of workloads and show energy reduction compared to an energy-efficient in-order core. We also demonstrate better area efficiency compared to kernel-scale offloads. 
    more » « less
  3. Latest algorithmic development has brought competitive classification accuracy for neural networks despite constraining the network parameters to ternary or binary representations. These findings show significant optimization opportunities to replace computationally-intensive convolution operations (based on multiplication) with more efficient and less complex operations such as addition. In hardware implementation domain, processing-in-memory architecture is becoming a promising solution to alleviate enormous energy-hungry data communication between memory and processing units, bringing considerable improvement for system performance and energy efficiency while running such large networks. In this paper, we review several of our recent works regarding Processing-in-Memory (PIM) accelerator based on Magnetic Random Access Memory computational sub-arrays to accelerate the inference mode of quantized neural networks using digital non-volatile memory rather than using analog crossbar operation. In this way, we investigate the performance of two distinct in-memory addition schemes compared to other digital methods based on processing-in-DRAM/GPU/ASIC design to tackle DNN power and memory wall bottleneck. 
    more » « less
  4. In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks. 
    more » « less
  5. In this paper, we propose MRIMA, as a novel MRAM-based In-Memory Accelerator for non-volatile, flexible, and efficient in-memory computing. MRIMA transforms current Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) arrays to massively parallel computational units capable of working as both non-volatile memory and in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, MRIMA exploits hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a single clock cycle, overcoming the multi-cycle logic issue in contemporary Processing-In-Memory (PIM) platforms. We present practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width Convolutional Neural Networks (CNN) as well as data encryption. Our device-to-architecture co-simulation results on CNN acceleration demonstrate that MRIMA can obtain 1.7× better energy-efficiency and 11.2× speed-up compared to ASICs, and, 1.8× better energy-efficiency and 2.4× speed-up over the best DRAM-based PIM solutions. As an AES in-memory encryption engine, MRIMA shows 77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain wall-based design, respectively. 
    more » « less