skip to main content


Title: Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses
Abstract Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5’ end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS Identification problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS and Gene Identification problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem, which includes a web-based visualization tool to explore the space of near-optimal solutions. We show that CORSID outperforms stateof-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.  more » « less
Award ID(s):
2027669 2046488 1850502
NSF-PAR ID:
10333879
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Leitner, Thomas
Date Published:
Journal Name:
Molecular Biology and Evolution
ISSN:
0737-4038
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic] 
    more » « less
  2. Almost all regulation of gene expression in eukaryotic genomes is mediated by the action of distant non-coding transcriptional enhancers upon proximal gene promoters. Enhancer locations cannot be accurately predicted bioinformatically because of the absence of a defined sequence code, and thus functional assays are required for their direct detection. Here we used a massively parallel reporter assay, Self-Transcribing Active Regulatory Region sequencing (STARR-seq), to generate the first comprehensive genome-wide map of enhancers in Anopheles coluzzii , a major African malaria vector in the Gambiae species complex. The screen was carried out by transfecting reporter libraries created from the genomic DNA of 60 wild A. coluzzii from Burkina Faso into A. coluzzii 4a3A cells, in order to functionally query enhancer activity of the natural population within the homologous cellular context. We report a catalog of 3,288 active genomic enhancers that were significant across three biological replicates, 74% of them located in intergenic and intronic regions. The STARR-seq enhancer screen is chromatin-free and thus detects inherent activity of a comprehensive catalog of enhancers that may be restricted in vivo to specific cell types or developmental stages. Testing of a validation panel of enhancer candidates using manual luciferase assays confirmed enhancer function in 26 of 28 (93%) of the candidates over a wide dynamic range of activity from two to at least 16-fold activity above baseline. The enhancers occupy only 0.7% of the genome, and display distinct composition features. The enhancer compartment is significantly enriched for 15 transcription factor binding site signatures, and displays divergence for specific dinucleotide repeats, as compared to matched non-enhancer genomic controls. The genome-wide catalog of A. coluzzii enhancers is publicly available in a simple searchable graphic format. This enhancer catalogue will be valuable in linking genetic and phenotypic variation, in identifying regulatory elements that could be employed in vector manipulation, and in better targeting of chromosome editing to minimize extraneous regulation influences on the introduced sequences. Importance: Understanding the role of the non-coding regulatory genome in complex disease phenotypes is essential, but even in well-characterized model organisms, identification of regulatory regions within the vast non-coding genome remains a challenge. We used a large-scale assay to generate a genome wide map of transcriptional enhancers. Such a catalogue for the important malaria vector, Anopheles coluzzii , will be an important research tool as the role of non-coding regulatory variation in differential susceptibility to malaria infection is explored and as a public resource for research on this important insect vector of disease. 
    more » « less
  3. Background Transposable element (TE) polymorphisms are important components of population genetic variation. The functional impacts of TEs in gene regulation and generating genetic diversity have been observed in multiple species, but the frequency and magnitude of TE variation is under appreciated. Inexpensive and deep sequencing technology has made it affordable to apply population genetic methods to whole genomes with methods that identify single nucleotide and insertion/deletion polymorphisms. However, identifying TE polymorphisms, particularly transposition events or non-reference insertion sites can be challenging due to the repetitive nature of these sequences, which hamper both the sensitivity and specificity of analysis tools. Methods We have developed the tool RelocaTE2 for identification of TE insertion sites at high sensitivity and specificity. RelocaTE2 searches for known TE sequences in whole genome sequencing reads from second generation sequencing platforms such as Illumina. These sequence reads are used as seeds to pinpoint chromosome locations where TEs have transposed. RelocaTE2 detects target site duplication (TSD) of TE insertions allowing it to report TE polymorphism loci with single base pair precision. Results and Discussion The performance of RelocaTE2 is evaluated using both simulated and real sequence data. RelocaTE2 demonstrate high level of sensitivity and specificity, particularly when the sequence coverage is not shallow. In comparison to other tools tested, RelocaTE2 achieves the best balance between sensitivity and specificity. In particular, RelocaTE2 performs best in prediction of TSDs for TE insertions. Even in highly repetitive regions, such as those tested on rice chromosome 4, RelocaTE2 is able to report up to 95% of simulated TE insertions with less than 0.1% false positive rate using 10-fold genome coverage resequencing data. RelocaTE2 provides a robust solution to identify TE insertion sites and can be incorporated into analysis workflows in support of describing the complete genotype from light coverage genome sequencing. 
    more » « less
  4. Abstract

    Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.

     
    more » « less
  5. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx. 
    more » « less