skip to main content

This content will become publicly available on May 1, 2023

Title: Complete Sequence of a 641-kb Insertion of Mitochondrial DNA in the Arabidopsis thaliana Nuclear Genome
Abstract Intracellular transfers of mitochondrial DNA continue to shape nuclear genomes. Chromosome 2 of the model plant Arabidopsis thaliana contains one of the largest known nuclear insertions of mitochondrial DNA (numts). Estimated at over 600 kb in size, this numt is larger than the entire Arabidopsis mitochondrial genome. The primary Arabidopsis nuclear reference genome contains less than half of the numt because of its structural complexity and repetitiveness. Recent data sets generated with improved long-read sequencing technologies (PacBio HiFi) provide an opportunity to finally determine the accurate sequence and structure of this numt. We performed a de novo assembly using sequencing data from recent initiatives to span the Arabidopsis centromeres, producing a gap-free sequence of the Chromosome 2 numt, which is 641 kb in length and has 99.933% nucleotide sequence identity with the actual mitochondrial genome. The numt assembly is consistent with the repetitive structure previously predicted from fiber-based fluorescent in situ hybridization. Nanopore sequencing data indicate that the numt has high levels of cytosine methylation, helping to explain its biased spectrum of nucleotide sequence divergence and supporting previous inferences that it is transcriptionally inactive. The original numt insertion appears to have involved multiple mitochondrial DNA copies with alternative structures that subsequently more » underwent an additional duplication event within the nuclear genome. This work provides insights into numt evolution, addresses one of the last unresolved regions of the Arabidopsis reference genome, and represents a resource for distinguishing between highly similar numt and mitochondrial sequences in studies of transcription, epigenetic modifications, and de novo mutations. « less
; ; ; ; ;
Slotte, Tanja
Award ID(s):
Publication Date:
Journal Name:
Genome Biology and Evolution
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implementedmore »a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.« less
  2. Background

    Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enablingde novoassembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes.


    Here we evaluatede novoassembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes.


    Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCRmore »cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes.


    PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improvedde novogenome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.

    « less
  3. The blue crab, Callinectes sapidus (Rathbun, 1896) is an economically, culturally, and ecologically important species found across the temperate and tropical North and South American Atlantic coast. A reference genome will enable research for this high-value species. Initial assembly combined 200× coverage Illumina paired-end reads, a 60× 8 kb mate-paired library, and 50× PacBio data using the MaSuRCA assembler resulting in a 985 Mb assembly with a scaffold N50 of 153 kb. Dovetail Chicago and HiC sequencing with the 3d DNA assembler and Juicebox assembly tools were then used for chromosome scaffolding. The 50 largest scaffolds span 810 Mb are 1.5–37 Mb long and have a repeat content of 36%. The 190 Mb unplaced sequence is in 3921 sequences over 10 kb with a repeat content of 68%. The final assembly N50 is 18.9 Mb for scaffolds and 9317 bases for contigs. Of arthropod BUSCO, ∼88% (888/1013) were complete and single copies. Using 309 million RNAseq read pairs from 12 different tissues and developmental stages, 25,249 protein-coding genes were predicted. Between C. sapidus and Portunus trituberculatus genomes, 41 of 50 large scaffolds had high nucleotide identity and protein-coding synteny, but 9 scaffolds in both assemblies were not clear matches. The protein-coding genes included 9423 one-to-one putative orthologs, ofmore »which 7165 were syntenic between the two crab species. Overall, the two crab genome assemblies show strong similarities at the nucleotide, protein, and chromosome level and verify the blue crab genome as an excellent reference for this important seafood species.« less
  4. Sekelsky, J (Ed.)
    Abstract Although plant mitochondrial genomes typically show low rates of sequence evolution, levels of divergence in certain angiosperm lineages suggest anomalously high mitochondrial mutation rates. However, de novo mutations have never been directly analyzed in such lineages. Recent advances in high-fidelity DNA sequencing technologies have enabled detection of mitochondrial mutations when still present at low heteroplasmic frequencies. To date, these approaches have only been performed on a single plant species (Arabidopsis thaliana). Here, we apply a high-fidelity technique (Duplex Sequencing) to multiple angiosperms from the genus Silene, which exhibits extreme heterogeneity in rates of mitochondrial sequence evolution among close relatives. Consistent with phylogenetic evidence, we found that Silene latifolia maintains low mitochondrial variant frequencies that are comparable with previous measurements in Arabidopsis. Silene noctiflora also exhibited low variant frequencies despite high levels of historical sequence divergence, which supports other lines of evidence that this species has reverted to lower mitochondrial mutation rates after a past episode of acceleration. In contrast, S. conica showed much higher variant frequencies in mitochondrial (but not in plastid) DNA, consistent with an ongoing bout of elevated mitochondrial mutation rates. Moreover, we found an altered mutational spectrum in S. conica heavily biased towards AT→GC transitions. Wemore »also observed an unusually low number of mitochondrial genome copies per cell in S. conica, potentially pointing to reduced opportunities for homologous recombination to accurately repair mismatches in this species. Overall, these results suggest that historical fluctuations in mutation rates are driving extreme variation in rates of plant mitochondrial sequence evolution.« less
  5. Abstract Background

    The Aldabra giant tortoise (Aldabrachelys gigantea) is one of only two giant tortoise species left in the world. The species is endemic to Aldabra Atoll in Seychelles and is listed as Vulnerable on the International Union for Conservation of Nature Red List (v2.3) due to its limited distribution and threats posed by climate change. Genomic resources for A. gigantea are lacking, hampering conservation efforts for both wild and ex situpopulations. A high-quality genome would also open avenues to investigate the genetic basis of the species’ exceptionally long life span.


    We produced the first chromosome-level de novo genome assembly of A. gigantea using PacBio High-Fidelity sequencing and high-throughput chromosome conformation capture. We produced a 2.37-Gbp assembly with a scaffold N50 of 148.6 Mbp and a resolution into 26 chromosomes. RNA sequencing–assisted gene model prediction identified 23,953 protein-coding genes and 1.1 Gbp of repetitive sequences. Synteny analyses among turtle genomes revealed high levels of chromosomal collinearity even among distantly related taxa. To assess the utility of the high-quality assembly for species conservation, we performed a low-coverage resequencing of 30 individuals from wild populations and two zoo individuals. Our genome-wide population structure analyses detected genetic population structure in the wild and identifiedmore »the most likely origin of the zoo-housed individuals. We further identified putatively deleterious mutations to be monitored.


    We establish a high-quality chromosome-level reference genome for A. gigantea and one of the most complete turtle genomes available. We show that low-coverage whole-genome resequencing, for which alignment to the reference genome is a necessity, is a powerful tool to assess the population structure of the wild population and reveal the geographic origins of ex situ individuals relevant for genetic diversity management and rewilding efforts.

    « less