Title: Improving bacterial genome assembly using a test of strand orientation
Abstract Summary The complexity of genome assembly is due in large part to the presence of repeats. In particular, large reverse-complemented repeats can lead to incorrect inversions of large segments of the genome. To detect and correct such inversions in finished bacterial genomes, we propose a statistical test based on tetranucleotide frequency (TNF), which determines whether two segments from the same genome are of the same or opposite orientation. In most cases, the test neatly partitions the genome into two segments of roughly equal length with seemingly opposite orientations. This corresponds to the segments between the DNA replication origin and terminus, which were previously known to have distinct nucleotide compositions. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. After inverting the sequence between the repeat, the balance of the misassembled genome is restored. Our method identifies 31 potential misassemblies in the NCBI database, several of which are further supported by a reassembly of the read data. Availability and implementation A github repository is available at https://github.com/gcgreenberg/Oriented-TNF.git. Supplementary information Supplementary data are available at Bioinformatics online. more »« less
Planta, Jose; Liang, Yu-Ya; Xin, Haoyang; Chansler, Matthew T; Prather, L Alan; Jiang, Ning; Jiang, Jiming; Childs, Kevin L
(, G3 Genes|Genomes|Genetics)
Tribble, C
(Ed.)
Abstract The majority of sequenced genomes in the monocots are from species belonging to Poaceae, which include many commercially important crops. Here, we expand the number of sequenced genomes from the monocots to include the genomes of 4 related cyperids: Carex cristatella and Carex scoparia from Cyperaceae and Juncus effusus and Juncus inflexus from Juncaceae. The high-quality, chromosome-scale genome sequences from these 4 cyperids were assembled by combining whole-genome shotgun sequencing of Nanopore long reads, Illumina short reads, and Hi-C sequencing data. Some members of the Cyperaceae and Juncaceae are known to possess holocentric chromosomes. We examined the repeat landscapes in our sequenced genomes to search for potential repeats associated with centromeres. Several large satellite repeat families, comprising 3.2–9.5% of our sequenced genomes, showed dispersed distribution of large satellite repeat clusters across all Carex chromosomes, with few instances of these repeats clustering in the same chromosomal regions. In contrast, most large Juncus satellite repeats were clustered in a single location on each chromosome, with sporadic instances of large satellite repeats throughout the Juncus genomes. Recognizable transposable elements account for about 20% of each of the 4 genome assemblies, with the Carex genomes containing more DNA transposons than retrotransposons while the converse is true for the Juncus genomes. These genome sequences and annotations will facilitate better comparative analysis within monocots.
Lee, Chaehee; Choi, In‐Su; Cardoso, Domingos; de Lima, Haroldo C.; de Queiroz, Luciano P.; Wojciechowski, Martin F.; Jansen, Robert K.; Ruhlman, Tracey A.
(, The Plant Journal)
Summary The plastid genome (plastome), while surprisingly constant in gene order and content across most photosynthetic angiosperms, exhibits variability in several unrelated lineages. During the diversification history of the legume family Fabaceae, plastomes have undergone many rearrangements, including inversions, expansion, contraction and loss of the typical inverted repeat (IR), gene loss and repeat accumulation in both shared and independent events. While legume plastomes have been the subject of study for some time, most work has focused on agricultural species in the IR‐lacking clade (IRLC) and the plant modelMedicago truncatula. The subfamily Papilionoideae, which contains virtually all of the agricultural legume species, also comprises most of the plastome variation detected thus far in the family. In this study three non‐papilioniods were included among 34 newly sequenced legume plastomes, along with 33 publicly available sequences, to assess plastome structural evolution in the subfamily. In an effort to examine plastome variation across the subfamily, approximately 20% of the sampling represents the IRLC with the remainder selected to represent the early‐branching papilionoid clades. A number of IR‐related and repeat‐mediated changes were identified and examined in a phylogenetic context. Recombination between direct repeats associated withycf2resulted in intraindividual plastome heteroplasmy. Although loss of the IR has not been reported in legumes outside of the IRLC, one genistoid taxon was found to completely lack the typical plastome IR. The role of the IR and non‐IR repeats in the progression of plastome change is discussed.
Hisey, Julia A.; Radchenko, Elina A.; Mandel, Nicholas H.; McGinty, Ryan J.; Matos-Rodrigues, Gabriel; Rastokina, Anastasia; Masnovo, Chiara; Ceschi, Silvia; Hernandez, Alfredo; Nussenzweig, André; et al
(, Nucleic Acids Research)
Abstract CANVAS is a recently characterized repeat expansion disease, most commonly caused by homozygous expansions of an intronic (A2G3)n repeat in the RFC1 gene. There are a multitude of repeat motifs found in the human population at this locus, some of which are pathogenic and others benign. In this study, we conducted structure-functional analyses of the pathogenic (A2G3)n and nonpathogenic (A4G)n repeats. We found that the pathogenic, but not the nonpathogenic, repeat presents a potent, orientation-dependent impediment to DNA polymerization in vitro. The pattern of the polymerization blockage is consistent with triplex or quadruplex formation in the presence of magnesium or potassium ions, respectively. Chemical probing of both repeats in vitro reveals triplex H-DNA formation by only the pathogenic repeat. Consistently, bioinformatic analysis of S1-END-seq data from human cell lines shows preferential H-DNA formation genome-wide by (A2G3)n motifs over (A4G)n motifs. Finally, the pathogenic, but not the nonpathogenic, repeat stalls replication fork progression in yeast and human cells. We hypothesize that the CANVAS-causing (A2G3)n repeat represents a challenge to genome stability by folding into alternative DNA structures that stall DNA replication.
Hoyt, Savannah J.; Storer, Jessica M.; Hartley, Gabrielle A.; Grady, Patrick G.; Gershman, Ariel; de Lima, Leonardo G.; Limouse, Charles; Halabian, Reza; Wojenski, Luke; Rodriguez, Matias; et al
(, Science)
INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.
Song, Shuang; Shan, Nayang; Wang, Geng; Yan, Xiting; Liu, Jun S; Hou, Lin
(, Bioinformatics)
Schwartz, Russell
(Ed.)
Abstract Motivation Identification and interpretation of non-coding variations that affect disease risk remain a paramount challenge in genome-wide association studies (GWAS) of complex diseases. Experimental efforts have provided comprehensive annotations of functional elements in the human genome. On the other hand, advances in computational biology, especially machine learning approaches, have facilitated accurate predictions of cell-type-specific functional annotations. Integrating functional annotations with GWAS signals has advanced the understanding of disease mechanisms. In previous studies, functional annotations were treated as static of a genomic region, ignoring potential functional differences imposed by different genotypes across individuals. Results We develop a computational approach, Openness Weighted Association Studies (OWAS), to leverage and aggregate predictions of chromosome accessibility in personal genomes for prioritizing GWAS signals. The approach relies on an analytical expression we derived for identifying disease associated genomic segments whose effects in the etiology of complex diseases are evaluated. In extensive simulations and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways. We use rheumatic arthritis and asthma as examples to demonstrate how OWAS can be exploited to provide novel insights on complex diseases. Availability and implementation The R package OWAS that implements our method is available at https://github.com/shuangsong0110/OWAS. Supplementary information Supplementary data are available at Bioinformatics online.
Greenberg, Grant, and Shomorony, Ilan. Improving bacterial genome assembly using a test of strand orientation. Retrieved from https://par.nsf.gov/biblio/10416295. Bioinformatics 38.Supplement_2 Web. doi:10.1093/bioinformatics/btac516.
Greenberg, Grant, & Shomorony, Ilan. Improving bacterial genome assembly using a test of strand orientation. Bioinformatics, 38 (Supplement_2). Retrieved from https://par.nsf.gov/biblio/10416295. https://doi.org/10.1093/bioinformatics/btac516
@article{osti_10416295,
place = {Country unknown/Code not available},
title = {Improving bacterial genome assembly using a test of strand orientation},
url = {https://par.nsf.gov/biblio/10416295},
DOI = {10.1093/bioinformatics/btac516},
abstractNote = {Abstract Summary The complexity of genome assembly is due in large part to the presence of repeats. In particular, large reverse-complemented repeats can lead to incorrect inversions of large segments of the genome. To detect and correct such inversions in finished bacterial genomes, we propose a statistical test based on tetranucleotide frequency (TNF), which determines whether two segments from the same genome are of the same or opposite orientation. In most cases, the test neatly partitions the genome into two segments of roughly equal length with seemingly opposite orientations. This corresponds to the segments between the DNA replication origin and terminus, which were previously known to have distinct nucleotide compositions. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. After inverting the sequence between the repeat, the balance of the misassembled genome is restored. Our method identifies 31 potential misassemblies in the NCBI database, several of which are further supported by a reassembly of the read data. Availability and implementation A github repository is available at https://github.com/gcgreenberg/Oriented-TNF.git. Supplementary information Supplementary data are available at Bioinformatics online.},
journal = {Bioinformatics},
volume = {38},
number = {Supplement_2},
author = {Greenberg, Grant and Shomorony, Ilan},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.