INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.
more »
« less
Accurate allocation of multimapped reads enables regulatory element analysis at repeats
Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. However, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq. Most regulatory genomics analysis pipelines discard “multimapped” reads that align equally well to multiple genomic locations. Because multimapped reads arise predominantly from repeats, current analysis pipelines fail to detect a substantial portion of regulatory events that occur in repetitive regions. To address this shortcoming, we developed Allo, a new approach to allocate multimapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multimapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multimapping read assignment. Allo also provides read-level output in the form of a corrected alignment file, making it compatible with existing regulatory genomics analysis pipelines and downstream peak-finders. In a demonstration application on CTCF ChIP-seq data, we show that Allo results in the discovery of thousands of new CTCF peaks. Many of these peaks contain the expected cognate motif and/or serve as TAD boundaries. We additionally apply Allo to a diverse collection of ENCODE ChIP-seq data sets, resulting in multiple previously unidentified interactions between transcription factors and repetitive element families. Finally, we show that Allo may be particularly beneficial in identifying ChIP-seq peaks at centromeres, near segmentally duplicated genes, and in younger TEs, enabling new regulatory analyses in these regions.
more »
« less
- Award ID(s):
- 2045500
- PAR ID:
- 10574733
- Publisher / Repository:
- Cold Spring Harbor Laboratory Press
- Date Published:
- Journal Name:
- Genome Research
- Volume:
- 34
- Issue:
- 6
- ISSN:
- 1088-9051
- Page Range / eLocation ID:
- 937 to 951
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Background Recent development of bioinformatics tools for Next Generation Sequencing data has facilitated complex analyses and prompted large scale experimental designs for comparative genomics. When combined with the advances in network inference tools, this can lead to powerful methodologies for mining genomics data, allowing development of pipelines that stretch from sequence reads mapping to network inference. However, integrating various methods and tools available over different platforms requires a programmatic framework to fully exploit their analytic capabilities. Integrating multiple genomic analysis tools faces challenges from standardization of input and output formats, normalization of results for performing comparative analyses, to developing intuitive and easy to control scripts and interfaces for the genomic analysis pipeline. Results We describe here NetSeekR, a network analysis R package that includes the capacity to analyze time series of RNA-Seq data, to perform correlation and regulatory network inferences and to use network analysis methods to summarize the results of a comparative genomics study. The software pipeline includes alignment of reads, differential gene expression analysis, correlation network analysis, regulatory network analysis, gene ontology enrichment analysis and network visualization of differentially expressed genes. The implementation provides support for multiple RNA-Seq read mapping methods and allows comparative analysis of the results obtained by different bioinformatics methods. Conclusion Our methodology increases the level of integration of genomics data analysis tools to network inference, facilitating hypothesis building, functional analysis and genomics discovery from large scale NGS data. When combined with network analysis and simulation tools, the pipeline allows for developing systems biology methods using large scale genomics data.more » « less
-
Abstract The genomes of flowering plants consist largely of transposable elements (TEs), some of which modulate gene regulation and function. However, the repetitive nature of TEs and difficulty of mapping individual TEs by short-read-sequencing have hindered our understanding of their regulatory potential. We demonstrate that long-read chromatin fiber sequencing (Fiber-seq) comprehensively identifies accessible chromatin regions (ACRs) and CpG methylation across the maize genome. We uncover stereotypical ACR patterns at young TEs that degenerate with evolutionary age, resulting in TE-enhancers preferentially marked by a novel plant-specific epigenetic feature: simultaneous hyper-CpG methylation and chromatin accessibility. We show that TE ACRs are co-opted as gene promoters and that ACR-containing TEs can facilitate gene amplification. Lastly, we uncover a pervasive epigenetic signature – hypo-5mCpG methylation and diffuse chromatin accessibility – directing TEs to specific loci, including the loci that sparked McClintock’s discovery of TEs.more » « less
-
Abstract Mobile element insertions (MEIs) are repetitive genomic sequences that contribute to genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9-targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on-targeted signals and exhibiting a 13.4-54x enrichment over whole-genome approaches. We show an individual flow cell can recover most MEIs (97% L1Hs, 93% Alu Yb, 51% Alu Ya, 99% SVA_F, and 65% SVA_E). We identify seventeen non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.more » « less
-
Genome-wide nucleosome profiles are predominantly characterized using MNase-seq, which involves extensive MNase digestion and size selection to enrich for mononucleosome-sized fragments. Most available MNase-seq analysis packages assume that nucleosomes uniformly protect 147 bp DNA fragments. However, some nucleosomes with atypical histone or chemical compositions protect shorter lengths of DNA. The rigid assumptions imposed by current nucleosome analysis packages potentially prevent investigators from understanding the regulatory roles played by atypical nucleosomes. To enable the characterization of different nucleosome types from MNase-seq data, we introduce the size-based expectation maximization (SEM) nucleosome-calling package. SEM employs a hierarchical Gaussian mixture model to estimate nucleosome positions and subtypes. Nucleosome subtypes are automatically identified based on the distribution of protected DNA fragments. Benchmark analysis indicates that SEM is on par with existing packages in terms of standard nucleosome-calling accuracy metrics, while uniquely providing the ability to characterize nucleosome subtype identities. Applying SEM to a low-dose MNase-H2B-ChIP-seq data set from mouse embryonic stem cells, we identified three nucleosome types: short-fragment nucleosomes, canonical nucleosomes, and di-nucleosomes. Short-fragment nucleosomes can be divided further into two subtypes based on their chromatin accessibility. Short-fragment nucleosomes in accessible regions exhibit high MNase sensitivity and are enriched at transcription start sites (TSSs) and CTCF peaks, similar to previously reported “fragile nucleosomes.” These SEM-defined accessible short-fragment nucleosomes are found not just in promoters but also in distal regulatory regions. Additional analyses reveal their colocalization with the chromatin remodelers CHD6, CHD8, and EP400. In summary, SEM provides an effective platform for exploration of nonstandard nucleosome subtypes.more » « less
An official website of the United States government

