Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Zhang, Chuanyi; Sashittal, Palash; Xiang, Michael; Zhang, Yichi; Kazi, Ayesha; El-Kebir, Mohammed

doi:10.1093/molbev/msac133

Abstract Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5’ end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS Identification problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the ﬁrst time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS and Gene Identification problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem, which includes a web-based visualization tool to explore the space of near-optimal solutions. We show that CORSID outperforms stateof-the-art gene ﬁnding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identiﬁcation of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the ﬁrst method to perform accurate and simultaneous identiﬁcation of TRS sites and genes in coronavirus genomes without the use of any prior information.

More Like this