Pangenome-Informed Language Models for Synthetic Genome Sequence Generation

Huang, Pengzhi; Charton, François; Schmelzle, Jan-Niklas M; Darnell, Shelby S; Prins, Pjotr; Garrison, Erik; Suh, G Edward

doi:10.1101/2024.09.18.612131

Citation Details

Pangenome-Informed Language Models for Synthetic Genome Sequence Generation

Abstract Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting a promising direction for the public sharing of genomic datasets. more »

Award ID(s):: 2118743

PAR ID:: 10630577

Author(s) / Creator(s):: Huang, Pengzhi; Charton, François; Schmelzle, Jan-Niklas M; Darnell, Shelby S; Prins, Pjotr; Garrison, Erik; Suh, G Edward

Publisher / Repository:: bioRxiv

Date Published:: 2024-09-20

Format(s):: Medium: X

Institution:: bioRxiv

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Posted Content:
https://doi.org/10.1101/2024.09.18.612131

More Like this