Novel Grammar-Based Compression Algorithms for Pangenome Analysis

Dood, Jordan; Cleary, Alan M.

Citation Details

Recent advancements in DNA sequencing and assembly have drastically lowered cost and improved quality. This has allowed for collections of genomes to be created that better reflect the variability within a single species. These pangenomes continue to grow in size and scope as new sequences are added, yet such collections have already proven to be challenging to handle without significant computational infrastructure, with the primary challenge being the large data size. Unfortunately, existing compression algorithms do not allow analysis to be performed directly on the compressed data. Furthermore, many common compression paradigms do not take advantage of the high similarity between genomes from the same species, resulting in compression that scales relative to data size rather than relative to information content. In this work, we present and propose novel grammar-based compression algorithms designed specifically for pangenome analysis. By leveraging maximal repeats, these algorithms have the potential to enable pangenome analysis at unprecedented scales. more »

Award ID(s):: 2105391

PAR ID:: 10429893

Author(s) / Creator(s):: Dood, Jordan; Cleary, Alan M.

Date Published:: 2023-06-05

Journal Name:: Sequencing, Finishing and Analysis in the Future

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this