Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license.
more »
« less
MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data
Abstract Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analysis methods are developed and benchmarked on short-read metaHi-C datasets and there exists much room for improvement in terms of more scalable and stable analyses, especially for long-read metaHi-C data. Here we report MetaCC, an efficient and integrative framework for analyzing both short-read and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning. In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than the current state-of-the-art method HiCzin on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module can retrieve 709 high-quality genomes with the largest species diversity using one single sample, including an expansion of five uncultured members from the orderErysipelotrichales, and is the only binner that can recover the genome of one important speciesBacteroides vulgatus. Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy plasmids.
more »
« less
- Award ID(s):
- 2125142
- PAR ID:
- 10478145
- Publisher / Repository:
- Nature Communications
- Date Published:
- Journal Name:
- Nature Communications
- Volume:
- 14
- Issue:
- 1
- ISSN:
- 2041-1723
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Fiston-Lavier, Anna-Sophie (Ed.)Abstract MotivationHaplotagging is a method for linked-read sequencing, which leverages the cost-effectiveness and throughput of short-read sequencing while retaining part of the long-range haplotype information captured by long-read sequencing. Despite its utility and advantages over similar methods, existing linked-read analytical pipelines are incompatible with haplotagging data. ResultsWe describe Harpy, a modular and user-friendly software pipeline for processing all stages of haplotagged linked-read data, from raw sequence data to phased genotypes and structural variant detection. Availability and implementationhttps://github.com/pdimens/harpy.more » « less
-
Abstract Sample preservation often impedes efforts to generate high-quality reference genomes or pangenomes for Earth’s more than 2 million plant and animal species due to nucleotide degradation. Here we compare the impacts of storage methods including solution type, temperature, and time on DNA quality and Oxford Nanopore long-read sequencing quality in 9 fish and 4 plant species. We show 95% ethanol largely protects against degradation for fish blood (22 °C, ≤6 weeks) and plant tissue (4 °C, ≤3 weeks). From this furthest storage timepoint, we assemble high-quality reference genomes of 3 fish and 2 plant species with contiguity (contig N50) and completeness (BUSCO) that achieve the Vertebrate Genome Project benchmarking standards. For epigenetic applications, we also report methylation frequency compared to liquid nitrogen control. The results presented here remove the necessity for cryogenic storage in many long read applications and provide a framework for future studies focused on sampling in remote locations, which may represent a large portion of the future sequencing of novel organisms.more » « less
-
Abstract Many applications in molecular ecology require the ability to match specific DNA sequences from single‐ or mixed‐species samples with a diagnostic reference library. Widely used methods for DNA barcoding and metabarcoding employ PCR and amplicon sequencing to identify taxa based on target sequences, but the target‐specific enrichment capabilities of CRISPR‐Cas systems may offer advantages in some applications. We identified 54,837 CRISPR‐Cas guide RNAs that may be useful for enriching chloroplast DNA across phylogenetically diverse plant species. We tested a subset of 17 guide RNAs in vitro to enrich plant DNA strands ranging in size from diagnostic DNA barcodes of 1,428 bp to entire chloroplast genomes of 121,284 bp. We used an Oxford Nanopore sequencer to evaluate sequencing success based on both single‐ and mixed‐species samples, which yielded mean chloroplast sequence lengths of 2,530–11,367 bp, depending on the experiment. In comparison to mixed‐species experiments, single‐species experiments yielded more on‐target sequence reads and greater mean pairwise identity between contigs and the plant species' reference genomes. But nevertheless, these mixed‐species experiments yielded sufficient data to provide ≥48‐fold increase in sequence length and better estimates of relative abundance for a commercially prepared mixture of plant species compared to DNA metabarcoding based on the chloroplasttrnL‐P6 marker. Prior work developed CRISPR‐based enrichment protocols for long‐read sequencing and our experiments pioneered its use for plant DNA barcoding and chloroplast assemblies that may have advantages over workflows that require PCR and short‐read sequencing. Future work would benefit from continuing to develop in vitro and in silico methods for CRISPR‐based analyses of mixed‐species samples, especially when the appropriate reference genomes for contig assembly cannot be known a priori.more » « less
-
null (Ed.)Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .more » « less
An official website of the United States government

