INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.
more »
« less
The complex set of internal repeats in SpTransformer protein sequences result in multiple but limited alternative alignments
The SpTransformer ( SpTrf ) gene family encodes a set of proteins that function in the sea urchin immune system. The gene sequences have a series of internal repeats in a mosaic pattern that is characteristic of this family. This mosaic pattern necessitates the insertion of large gaps, which has made alignments of the deduced protein sequences computationally difficult such that only manual alignments have been reported previously. Because manual alignments are time consuming for evaluating newly available SpTrf sequences, computational approaches were evaluated for the sequences reported previously. Furthermore, because two different manual alignments of the SpTrf sequences are feasible because of the multiple internal repeats, it is not known whether additional alternative alignments can be identified using different approaches. The bioinformatic program, PRANK, was used because it was designed to align sequences with large gaps and indels. The results from PRANK show that the alignments of the internal repeats are similar to those done manually, suggesting multiple feasible alignments for some regions. GUIDANCE based analysis of the alignments identified regions that were excellent and other regions that failed to align. This suggests that computational approaches have limits for aligning the SpTrf sequences that include multiple repeats and that require inserted gaps. Furthermore, it is unlikely that alternative alignments for the full-length SpTrf sequences will be identified.
more »
« less
- Award ID(s):
- 1855747
- PAR ID:
- 10407668
- Date Published:
- Journal Name:
- Frontiers in Immunology
- Volume:
- 13
- ISSN:
- 1664-3224
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Rokas, A (Ed.)Abstract Subtelomeres are dynamic genomic regions shaped by elevated rates of recombination, mutation, and gene birth/death. These processes contribute to formation of lineage-specific gene family expansions that commonly occupy subtelomeres across eukaryotes. Investigating the evolution of subtelomeric gene families is complicated by the presence of repetitive DNA and high sequence similarity among gene family members that prevents accurate assembly from whole genome sequences. Here, we investigated the evolution of the telomere-associated (TLO) gene family in Candida albicans using 189 complete coding sequences retrieved from 23 genetically diverse strains across the species. Tlo genes conformed to the 3 major architectural groups (α/β/γ) previously defined in the genome reference strain but significantly differed in the degree of within-group diversity. One group, Tloβ, was always found at the same chromosome arm with strong sequence similarity among all strains. In contrast, diverse Tloα sequences have proliferated among chromosome arms. Tloγ genes formed 7 primary clades that included each of the previously identified Tloγ genes from the genome reference strain with 3 Tloγ genes always found on the same chromosome arm among strains. Architectural groups displayed regions of high conservation that resolved newly identified functional motifs, providing insight into potential regulatory mechanisms that distinguish groups. Thus, by resolving intraspecies subtelomeric gene variation, it is possible to identify previously unknown gene family complexity that may underpin adaptive functional variation.more » « less
-
de los Campos, G (Ed.)Abstract De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync’s correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions.more » « less
-
Abstract Candidatus Poribacteria is a little-known bacterial phylum, previously characterized by partial genomes from a single sponge host, but never isolated in culture. We have reconstructed multiple genome sequences from four different sponge genera and compared them to recently reported, uncharacterized Poribacteria genomes from the open ocean, discovering shared and unique functional characteristics. Two distinct, habitat-linked taxonomic lineages were identified, designated Entoporibacteria (sponge-associated) and Pelagiporibacteria (free-living). These lineages differed in flagellar motility and chemotaxis genes unique to Pelagiporibacteria, and highly expanded families of restriction endonucleases, DNA methylases, transposases, CRISPR repeats, and toxin–antitoxin gene pairs in Entoporibacteria. Both lineages shared pathways for facultative anaerobic metabolism, denitrification, fermentation, organosulfur compound utilization, type IV pili, cellulosomes, and bacterial proteosomes. Unexpectedly, many features characteristic of eukaryotic host association were also shared, including genes encoding the synthesis of eukaryotic-like cell adhesion molecules, extracellular matrix digestive enzymes, phosphoinositol-linked membrane glycolipids, and exopolysaccharide capsules. Complete Poribacteria 16S rRNA gene sequences were found to contain multiple mismatches to “universal” 16S rRNA gene primer sets, substantiating concerns about potential amplification failures in previous studies. A newly designed primer set corrects these mismatches, enabling more accurate assessment of Poribacteria abundance in diverse marine habitats where it may have previously been overlooked.more » « less
-
Hydrothermal sediments host phylogenetically diverse and physiologically complex microbial communities. Previous studies of microbial community structure in hydrothermal sediments have typically used short-read sequencing approaches. To improve on these approaches, we use LoopSeq, a high-throughput synthetic long-read sequencing method that has yielded promising results in analyses of microbial ecosystems, such as the human gut microbiome. In this study, LoopSeq is used to obtain near-full length (approximately 1,400–1,500 nucleotides) bacterial 16S rRNA gene sequences from hydrothermal sediments in Guaymas Basin. Based on these sequences, high-quality alignments and phylogenetic analyses provided new insights into previously unrecognized taxonomic diversity of sulfur-cycling microorganisms and their distribution along a lateral hydrothermal gradient. Detailed phylogenies for free-living and syntrophic sulfur-cycling bacterial lineages identified well-supported monophyletic clusters that have implications for the taxonomic classification of these groups. Particularly, we identify clusters withinCandidatusDesulfofervidus that represent unexplored physiological and genomic diversity. In general, LoopSeq-derived 16S rRNA gene sequences aligned consistently with reference sequences in GenBank; however, chimeras were prevalent in sequences as affiliated with the thermophilicCandidatusDesulfofervidus andThermodesulfobacterium, and in smaller numbers within the sulfur-oxidizing familyBeggiatoaceae. Our analysis of sediments along a well-documented thermal and geochemical gradient show how lineages affiliated with different sulfur-cycling taxonomic groups persist throughout surficial hydrothermal sediments in the Guaymas Basin.more » « less
An official website of the United States government

