PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Abrar, Md Hasin; Medvedev, Paul

doi:10.4230/LIPIcs.WABI.2024.13

Citation Details

PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer’s rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index. more »

Award ID(s):: 2138585 1931531

PAR ID:: 10616430

Author(s) / Creator(s):: Abrar, Md Hasin; Medvedev, Paul

Editor(s):: Pissis, Solon P; Sung, Wing-Kin

Publisher / Repository:: Schloss Dagstuhl – Leibniz-Zentrum für Informatik

Date Published:: 2024-01-01

Volume:: 312

ISSN:: 1868-8969

ISBN:: 978-3-95977-340-9

Page Range / eLocation ID:: 13:1-13:18

Subject(s) / Keyword(s):: K-mer index Piece-wise linear approximation Learned index Applied computing → Bioinformatics Applied computing → Computational biology Theory of computation → Data structures design and analysis

Format(s):: Medium: X Size: 18 pages; 2513915 bytes Other: application/pdf

Size(s):: 18 pages 2513915 bytes

Right(s):: Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.4230/LIPIcs.WABI.2024.13

More Like this