Efficiency of Learned Indexes on Genome Spectra

Abrar, Md Hasin; Medvedev, Paul; Vinciguerra, Giorgio

doi:10.4230/lipics.esa.2025.18

Citation Details

Efficiency of Learned Indexes on Genome Spectra

Data structures on a multiset of genomic k-mers are at the heart of many bioinformatic tools. As genomic datasets grow in scale, the efficiency of these data structures increasingly depends on how well they leverage the inherent patterns in the data. One recent and effective approach is the use of learned indexes that approximate the rank function of a multiset using a piecewise linear function with very few segments. However, theoretical worst-case analysis struggles to predict the practical performance of these indexes. We address this limitation by developing a novel measure of piecewise-linear approximability of the data, called CaPLa (Canonical Piecewise Linear approximability). CaPLa builds on the empirical observation that a power-law model often serves as a reasonable proxy for piecewise linear-approximability, while explicitly accounting for deviations from a true power-law fit. We prove basic properties of CaPLa and present an efficient algorithm to compute it. We then demonstrate that CaPLa can accurately predict space bounds for data structures on real data. Empirically, we analyze over 500 genomes through the lens of CaPLa, revealing that it varies widely across the tree of life and even within individual genomes. Finally, we study the robustness of CaPLa as a measure and the factors that make genomic k-mer multisets different from random ones. more »

Award ID(s):: 2138585 1931531

PAR ID:: 10664668

Author(s) / Creator(s):: Abrar, Md Hasin; Medvedev, Paul; Vinciguerra, Giorgio

Editor(s):: Benoit, Anne; Kaplan, Haim; Wild, Sebastian; Herman, Grzegorz

Publisher / Repository:: Schloss Dagstuhl – Leibniz-Zentrum für Informatik

Date Published:: 2025-10-01

Volume:: 351

ISSN:: 1868-8969

Page Range / eLocation ID:: 18:1-18:18

Subject(s) / Keyword(s):: Genome spectra piecewise linear approximation learned index k-mers Applied computing → Bioinformatics Applied computing → Computational biology Theory of computation → Data structures design and analysis

Format(s):: Medium: X Size: 18 pages; 1706856 bytes Other: application/pdf

Size(s):: 18 pages 1706856 bytes

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
https://doi.org/10.4230/lipics.esa.2025.18

More Like this