Solving the Minimal Positional Substring Cover Problem in Sublinear Space

Bonizzoni, Paola; Boucher, Christina; Cozzi, Davide; Gagie, Travis; Pirola, Yuri

doi:10.4230/LIPIcs.CPM.2024.12

Citation Details

Solving the Minimal Positional Substring Cover Problem in Sublinear Space

Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem. more »

Award ID(s):: 2013998

PAR ID:: 10549032

Author(s) / Creator(s):: Bonizzoni, Paola; Boucher, Christina; Cozzi, Davide; Gagie, Travis; Pirola, Yuri

Editor(s):: Inenaga, Shunsuke; Puglisi, Simon J

Publisher / Repository:: Schloss Dagstuhl – Leibniz-Zentrum für Informatik

Date Published:: 2024-01-01

Volume:: 296

ISSN:: 1868-8969

ISBN:: 978-3-95977-326-3

Page Range / eLocation ID:: 296-296

Subject(s) / Keyword(s):: Positional Burrows-Wheeler Transform r-index minimal position substring cover set-maximal exact matches Theory of computation → Data structures design and analysis

Format(s):: Medium: X Size: 16 pages; 874410 bytes Other: application/pdf

Size(s):: 16 pages 874410 bytes

Right(s):: Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.4230/LIPIcs.CPM.2024.12

More Like this