skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PLA-index: A k-mer Index Exploiting Rank Curve Linearity
Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer’s rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.  more » « less
Award ID(s):
2138585 1931531
PAR ID:
10616430
Author(s) / Creator(s):
;
Editor(s):
Pissis, Solon P; Sung, Wing-Kin
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
312
ISSN:
1868-8969
ISBN:
978-3-95977-340-9
Page Range / eLocation ID:
13:1-13:18
Subject(s) / Keyword(s):
K-mer index Piece-wise linear approximation Learned index Applied computing → Bioinformatics Applied computing → Computational biology Theory of computation → Data structures design and analysis
Format(s):
Medium: X Size: 18 pages; 2513915 bytes Other: application/pdf
Size(s):
18 pages 2513915 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory. 
    more » « less
  2. Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)
    The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct. 
    more » « less
  3. Mikolaj Bojanczyk; Emanuela Merelli; David P. Woodruff (Ed.)
    Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. - Space O(n logσ) bits and query time O(log_σ^ε n); - Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and - Space O(n logσ log^ε_σ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely. 
    more » « less
  4. Abstract MotivationK-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. ResultsWe derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementationA python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract The problem of sequence identification or matching—determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence—is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijngraph index, arising as the combination of ak-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph aremonochromatic(i.e., allk-mers in a unitig have the same set of references of origin, orcolor). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map fromk-mers to their colors in as little as 1 +o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called , and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to —the strongest competitor in terms of index space vs. query time trade-off— requires significantly less space (up to 43% less space for a collection of 150,000Salmonella entericagenomes), is at least twice as fast for color queries, and is 2–6$$\times$$ × faster to construct. 
    more » « less