An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs

Prosperi, Mattia; Marini, Simone; Boucher, Christina

doi:10.1109/ICHI61247.2024.00020

Citation Details

An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs

A problem extension of the longest common sub-string (LCS) between two texts is the enumeration of all LCSs given a minimum length k (ALCS-k), along with their positions in each text. In bioinformatics, an efficient solution to the ALCS- k for very long texts -genomes or metagenomes- can provide useful insights to discover genetic signatures responsible for biological mechanisms. The ALCS-k problem has two additional requirements compared to the LCS problem: one is the minimum length k , and the other is that all common strings longer than k must be reported. We present an efficient, two-stage ALCS-k algorithm exploiting the spectrum of text substrings of length k (k-mers). Our approach yields a worst-case time complexity loglinear in the number of k-mers for the first stage, and an average-case loglinear in the number of common k-mers for the second stage (several orders of magnitudes smaller than the total k-mer spectrum). The space complexity is linear in the first phase (disk-based), and on average linear in the second phase (disk- and memory-based). Tests performed on genomes for different organisms (including viruses, bacteria and animal chromosomes) show that run times are consistent with our theoretical estimates; further, comparisons with MUMmer4 show an asymptotic advantage with divergent genomes. more »

Award ID(s):: 2013998

PAR ID:: 10548611

Author(s) / Creator(s):: Prosperi, Mattia; Marini, Simone; Boucher, Christina

Publisher / Repository:: IEEE

Date Published:: 2024-06-03

ISBN:: 979-8-3503-8373-7

Page Range / eLocation ID:: 93 to 102

Format(s):: Medium: X

Location:: Orlando, FL, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ICHI61247.2024.00020

More Like this