CSTs for Terabyte-Sized Data

Oliva, Marco; Cenzato, Davide; Rossi, Massimiliano; Liptak, Zsuzsanna; Gagie, Travis; Boucher, Christina

doi:10.1109/DCC52660.2022.00017

Citation Details

CSTs for Terabyte-Sized Data

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enor- mously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature. more »

Award ID(s):: 2029552

PAR ID:: 10340624

Author(s) / Creator(s):: Oliva, Marco; Cenzato, Davide; Rossi, Massimiliano; Liptak, Zsuzsanna; Gagie, Travis; Boucher, Christina

Date Published:: 2022-03-01

Journal Name:: Data Compression Conference (DCC)

Page Range / eLocation ID:: 93 to 102

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/DCC52660.2022.00017

More Like this