Summary The plastid genome (plastome), while surprisingly constant in gene order and content across most photosynthetic angiosperms, exhibits variability in several unrelated lineages. During the diversification history of the legume family Fabaceae, plastomes have undergone many rearrangements, including inversions, expansion, contraction and loss of the typical inverted repeat (IR), gene loss and repeat accumulation in both shared and independent events. While legume plastomes have been the subject of study for some time, most work has focused on agricultural species in the IR‐lacking clade (IRLC) and the plant modelMedicago truncatula. The subfamily Papilionoideae, which contains virtually all of the agricultural legume species, also comprises most of the plastome variation detected thus far in the family. In this study three non‐papilioniods were included among 34 newly sequenced legume plastomes, along with 33 publicly available sequences, to assess plastome structural evolution in the subfamily. In an effort to examine plastome variation across the subfamily, approximately 20% of the sampling represents the IRLC with the remainder selected to represent the early‐branching papilionoid clades. A number of IR‐related and repeat‐mediated changes were identified and examined in a phylogenetic context. Recombination between direct repeats associated withycf2resulted in intraindividual plastome heteroplasmy. Although loss of the IR has not been reported in legumes outside of the IRLC, one genistoid taxon was found to completely lack the typical plastome IR. The role of the IR and non‐IR repeats in the progression of plastome change is discussed.
more »
« less
Plastid Genome Assembly Using Long‐read data
Abstract Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that experienced substantial rearrangements. Most plastomes contain a large inverted repeat and two single‐copy regions, and a few dispersed repeats; however, the plastomes of some taxa harbour long repeat sequences (>300 bp). These long repeats make it challenging to assemble complete plastomes using short‐read data, leading to misassemblies and consensus sequences with spurious rearrangements. Single‐molecule, long‐read sequencing has the potential to overcome these challenges, yet there is no consensus on the most effective method for accurately assembling plastomes using long‐read data. We generated a pipeline,plastidGenomeAssemblyUsingLong‐read data (ptGAUL), to address the problem of plastome assembly using long‐read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long‐read data sets. We showed that ptGAUL quickly produces accurate and unbiased assemblies using only ~50× coverage of plastome data. Additionally, we deployed ptGAUL to assemble four newJuncus(Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements inJuncusplastomes compared with basal lineages of Poales. The ptGAUL pipeline is available on GitHub:https://github.com/Bean061/ptgaul.
more »
« less
- Award ID(s):
- 2034929
- PAR ID:
- 10420741
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- Molecular Ecology Resources
- Volume:
- 23
- Issue:
- 6
- ISSN:
- 1755-098X
- Format(s):
- Medium: X Size: p. 1442-1457
- Size(s):
- p. 1442-1457
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Tribble, C (Ed.)Abstract The majority of sequenced genomes in the monocots are from species belonging to Poaceae, which include many commercially important crops. Here, we expand the number of sequenced genomes from the monocots to include the genomes of 4 related cyperids: Carex cristatella and Carex scoparia from Cyperaceae and Juncus effusus and Juncus inflexus from Juncaceae. The high-quality, chromosome-scale genome sequences from these 4 cyperids were assembled by combining whole-genome shotgun sequencing of Nanopore long reads, Illumina short reads, and Hi-C sequencing data. Some members of the Cyperaceae and Juncaceae are known to possess holocentric chromosomes. We examined the repeat landscapes in our sequenced genomes to search for potential repeats associated with centromeres. Several large satellite repeat families, comprising 3.2–9.5% of our sequenced genomes, showed dispersed distribution of large satellite repeat clusters across all Carex chromosomes, with few instances of these repeats clustering in the same chromosomal regions. In contrast, most large Juncus satellite repeats were clustered in a single location on each chromosome, with sporadic instances of large satellite repeats throughout the Juncus genomes. Recognizable transposable elements account for about 20% of each of the 4 genome assemblies, with the Carex genomes containing more DNA transposons than retrotransposons while the converse is true for the Juncus genomes. These genome sequences and annotations will facilitate better comparative analysis within monocots.more » « less
-
Telomeres consist of highly conserved simple tandem telomeric repeat motif (TRM): (TTAGG)n in arthropods, (TTAGGG)n in vertebrates, and (TTTAGGG)n in most plants. TRM can be detected from chromosome-level assembly, which typically requires long-read sequencing data. To take advantage of short-read data, we developed an ultra-fast Telomeric Repeats Identification Pipeline and evaluated its performance on 91 species. With proven accuracy, we applied Telomeric Repeats Identification Pipeline in 129 insect species, using 7 Tbp of short-read sequences. We confirmed (TTAGG)n as the TRM in 19 orders, suggesting it is the ancestral form in insects. Systematic profiling in Hymenopterans revealed a diverse range of TRMs, including the canonical 5-bp TTAGG (bees, ants, and basal sawflies), three independent losses of tandem repeat form TRM (Ichneumonoids, hunting wasps, and gall-forming wasps), and most interestingly, a common 8-bp (TTATTGGG)n in Chalcid wasps with two 9-bp variants in the miniature wasp (TTACTTGGG) and fig wasps (TTATTGGGG). Our results identified extraordinary evolutionary fluidity of Hymenopteran TRMs, and rapid evolution of TRM and repeat abundance at all evolutionary scales, providing novel insights into telomere evolution.more » « less
-
Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license.more » « less
-
High-throughput short-read sequencing has taken on a central role in research and diagnostics. Hundreds of different assays take advantage of Illumina short-read sequencers, the predominant short-read sequencing technology available today. Although other short-read sequencing technologies exist, the ubiquity of Illumina sequencers in sequencing core facilities and the high capital costs of these technologies have limited their adoption. Among a new generation of sequencing technologies, Oxford Nanopore Technologies (ONT) holds a unique position because the ONT MinION, an error-prone long-read sequencer, is associated with little to no capital cost. Here we show that we can make short-read Illumina libraries compatible with the ONT MinION by using the rolling circle to concatemeric consensus (R2C2) method to circularize and amplify the short library molecules. This results in longer DNA molecules containing tandem repeats of the original short library molecules. This longer DNA is ideally suited for the ONT MinION, and after sequencing, the tandem repeats in the resulting raw reads can be converted into high-accuracy consensus reads with similar error rates to that of the Illumina MiSeq. We highlight this capability by producing and benchmarking RNA-seq, ChIP-seq, and regular and target-enriched Tn5 libraries. We also explore the use of this approach for rapid evaluation of sequencing library metrics by implementing a real-time analysis workflow.more » « less
An official website of the United States government
