skip to main content

Title: EXMA: A Genomics Accelerator for Exact-Matching
Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FMIndex) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poor spatial locality and random memory access pattern. Prior works create GPU-, FPGA-, ASIC- and even process-in-memory (PIM)based accelerators to boost FM-Index search throughput. Though they achieve the state-of-the-art FM-Index search throughput, the same as all prior conventional accelerators, FM-Index PIMs process only one DNA symbol after each DRAM row activation, thereby suffering from poor memory bandwidth utilization. In this paper, we propose a hardware accelerator, EXMA, to enhance FM-Index search throughput. We first create a novel EXMA table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation. We then build an accelerator to search over an EXMA table. We propose 2-stage scheduling to increase the cache hit rate of our accelerator. We introduce dynamic page policy to improve the row buffer hit rate of DRAM main memory. We also present CHAIN more » compression to reduce the data structure size of EXMA tables. Compared to state-of-the-art FM-Index PIMs, EXMA improves search throughput by 4.9 ×, and enhances search throughput per Watt by 4.8×. « less
Award ID(s):
1908992 2105972
Publication Date:
Journal Name:
IEEE International Symposium on High-Performance Computer Architecture
Page Range or eLocation-ID:
399 to 411
Sponsoring Org:
National Science Foundation
More Like this
  1. Genomics is the critical key to enabling precision medicine, ensuring global food security and enforcing wildlife conservation. The massive genomic data produced by various genome sequencing technologies presents a significant challenge for genome analysis. Because of errors from sequencing machines and genetic variations, approximate pattern matching (APM) is a must for practical genome analysis. Recent work proposes FPGA, ASIC and even process-in-memory-based accelerators to boost the APM throughput by accelerating dynamic-programming-based algorithms (e.g., Smith-Waterman). However, existing accelerators lack the efficient hardware acceleration for the exact pattern matching (EPM) that is an even more critical and essential function widely used in almost every step of genome analysis including assembly, alignment, annotation and compression. State-of-the-art genome analysis adopts the FM-Index that augments the space-efficient BWT with additional data structures permitting fast EPM operations. But the FM-Index is notorious for poor spatial locality and massive random memory accesses. In this paper, we propose a ReRAM-based process-in-memory architecture, FindeR, to enhance the FM-Index EPM search throughput in genomic sequences. We build a reliable and energy-efficient Hamming distance unit to accelerate the computing kernel of FM-Index search using commodity ReRAM chips without introducing extra CMOS logic. We further architect a full-fledged FM-Index search pipeline andmore »improve its search throughput by lightweight scheduling on the NVDIMM. We also create a system library for programmers to invoke FindeR to perform EPMs in genome analysis. Compared to state-of-the-art accelerators, FindeR improves the FM-Index search throughput by 83% ~ 30K× and throughput per Watt by 3.5×~42.5K×.« less
  2. Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes 44.5% of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. Although conventional network quantization techniques reduce the computing overhead of a base-caller by replacing floating-point multiply-accumulations by cheaper fixed-point operations, it significantly increases the number of systematic errors that cannot be corrected by read votes. The power density of prior nonvolatile memory (NVM)-based PIMs has already exceeded memory thermal tolerance even with active heat sinks, because their power efficiency is severely limited by analog-to-digital converters (ADC). Finally, Connectionist Temporal Classification (CTC) decoding and read voting cost 53.7% of total execution time in a quantized base-caller, and thus became its new bottleneck. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecturemore »perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by 6x, throughput per Watt by 11.9x and per mm2 by 7.5x without degrading base-calling accuracy.« less
  3. In this paper, we propose ReDRAM, as a reconfigurable DRAM-based processing-in-memory (PIM) accelerator, which transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of modern memory chips. ReDRAM uses the analog operation of DRAM sub-arrays and elevates it to implement a full set of 1- and 2-input bulk bit-wise operations (NOT, (N)AND, (N)OR, and even X(N)OR) between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. ReDRAM can be leveraged to greatly reduce energy consumption and latency of complex in-DRAM logic computations relying on state-of-the-art mechanisms based on triple-row activation, dual-contact cells, row initialization, NOR style, etc. The extensive circuit-architecture simulations show that ReDRAM achieves on average 54× and 7.1× higher throughput for performing bulk bit-wise operations compared with CPU and GPU, respectively. Besides, ReDRAM outperforms recent processing-in-DRAM platforms with up to 3.7× better performance.
  4. Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level's parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm^2 by ~12X compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6X higher throughput per Watt.
  5. Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore's Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pairwise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tuple anchors as inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.