In this paper, for the first time, we propose a high-throughput and energy-efficient Processing-in-DRAM-accelerated genome assembler called PIM-Assembler based on an optimized and hardware-friendly genome assembly algorithm. PIM-Assembler can assemble large-scale DNA sequence dataset from all-pair overlaps. We first develop PIM-Assembler platform that harnesses DRAM as computational memory and transforms it to a fundamental processing unit for genome assembly. PIM-Assembler can perform efficient X(N)OR-based operations inside DRAM incurring low cost on top of commodity DRAM designs (~5% of chip area). PIM-Assembler is then optimized through a correlated data partitioning and mapping methodology that allows local storage and processing of DNA short reads to fully exploit the genome assembly algorithm-level's parallelism. The simulation results show that PIM-Assembler achieves on average 8.4× and 2.3 wise× higher throughput for performing bulk bit-XNOR-based comparison operations compared with CPU and recent processing-in-DRAM platforms, respectively. As for comparison/addition-extensive genome assembly application, it reduces the execution time and power by ~5× and ~ 7.5× compared to GPU.
ReDRAM: A Reconfigurable Processing-in-DRAM Platform for Accelerating Bulk Bit-Wise Operations
In this paper, we propose ReDRAM, as a reconfigurable DRAM-based processing-in-memory (PIM) accelerator, which transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of modern memory chips. ReDRAM uses the analog operation of DRAM sub-arrays and elevates it to implement a full set of 1- and 2-input bulk bit-wise operations (NOT, (N)AND, (N)OR, and even X(N)OR) between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. ReDRAM can be leveraged to greatly reduce energy consumption and latency of complex in-DRAM logic computations relying on state-of-the-art mechanisms based on triple-row activation, dual-contact cells, row initialization, NOR style, etc. The extensive circuit-architecture simulations show that ReDRAM achieves on average 54× and 7.1× higher throughput for performing bulk bit-wise operations compared with CPU and GPU, respectively. Besides, ReDRAM outperforms recent processing-in-DRAM platforms with up to 3.7× better performance.
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
- Page Range or eLocation-ID:
- 1 to 8
- Sponsoring Org:
- National Science Foundation
More Like this
Recently, in-DRAM computing is becoming one promising technique to address the notorious ‘memory-wall’ issue for big data processing. In this work, for the first time, we propose a novel ‘Min/Max-in-memory’ algorithm based on iterative XNOR bit-wise comparison, which supports parallel inmemory searching for minimum and maximum of bulk data stored in DRAM as unsigned & signed integers, fixed-point and floating numbers. We then develop a new processing-in-DRAM architecture, called Max-PIM, that supports complete bit-wise Boolean logic and beyond. Differentiating from prior works, Max-PIM is optimized with one-cycle fast XNOR logicin-DRAM operation and in-memory data transpose, which are heavily used and keys to accelerate the proposed Min/Max-in-memory algorithm efficiently. Extensive experiments of utilizing Max-PIM in big data sorting and graph processing applications show that it could speed up ~ 50X and ~ 1000X than GPU and CPU, while only consuming 10% and 1% energy, respectively. Moreover, comparing with recent representative In-DRAM computing platforms, i.e., Ambit , DRISA , our design could speed up ~ 3X - 10X.
In this paper we propose a Highly Flexible InMemory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality. It could pre-process data within memory to further reduce power hungry long distance communication between memory and processing units as in Von-Neumann computing system. HieIM can implement all the Boolean logic functions (AND/NAND, OR/NOR, XOR/XNOR) between any two cells in the same memory array, thus overcoming the `operand locality' problem in contemporary in-memory computing platform designs. To investigate the performance of HieIM, we test in-memory bulk bit-wise Boolean logic operations using different vector datasets, which shows ~ 8x energy saving and ~ 5x speedup compared to recent DRAM based in-memory computing platform. We further implement an in-memory data encryption engine design based on HieIM as another case study. With AES algorithm, it shows 51.5% and 68.9% lower energy consumption compared to CMOS-ASIC and CMOL based implementations, respectively.
Nowadays, research topics on AI accelerator designs have attracted great interest, where accelerating Deep Neural Network (DNN) using Processing-in-Memory (PIM) platforms is an actively-explored direction with great potential. PIM platforms, which simultaneously aims to address power- and memory-wall bottlenecks, have shown orders of performance enhancement in comparison to the conventional computing platforms with Von-Neumann architecture. As one direction of accelerating DNN in PIM, resistive memory array (aka. crossbar) has drawn great research interest owing to its analog current-mode weighted summation operation which intrinsically matches the dominant Multiplication-and-Accumulation (MAC) operation in DNN, making it one of the most promising candidates. An alternative direction for PIM-based DNN acceleration is through bulk bit-wise logic operations directly performed on the content in digital memories. Thanks to the high fault-tolerant characteristic of DNN, the latest algorithmic progression successfully quantized DNN parameters to low bit-width representations, while maintaining competitive accuracy levels. Such DNN quantization techniques essentially convert MAC operation to much simpler addition/subtraction or comparison operations, which can be performed by bulk bit-wise logic operations in a highly parallel fashion. In this paper, we build a comprehensive evaluation framework to quantitatively compare and analyze aforementioned PIM based analog and digital approaches for DNN acceleration.
In this paper, we propose a Flexible processing-in-DRAM framework named FlexiDRAM that supports the efficient implementation of complex bulk bitwise operations. This framework is developed on top of a new reconfigurable in-DRAM accelerator that leverages the analog operation of DRAM sub-arrays and elevates it to implement XOR2-MAJ3 operations between operands stored in the same bit-line. FlexiDRAM first generates an efficient XOR-MAJ representation of the desired logic and then appropriately allocates DRAM rows to the operands to execute any in-DRAM computation. We develop ISA and software support required to compute in-DRAM operation. FlexiDRAM transforms current memory architecture to a massively parallel computational unit and can be leveraged to significantly reduce the latency and energy consumption of complex workloads. Our extensive circuit-to-architecture simulation results show that averaged across two well-known deep learning workloads, FlexiDRAM achieves ∼15× energy-saving and 13× speedup over the GPU outperforming recent processing-in-DRAM platforms.