Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract MotivationPolymerase chain reaction (PCR) enables rapid, cost-effective diagnostics but requires prior identification of genomic regions that allow sensitive and specific detection of target microbial groups, herein referred to as microbial signature sequences. We introduce Seqwin, an open-source framework designed to automate microbial genome signature discovery. Tens of thousands of microbial genomes are now available for a single species, limiting the application of existing manual and automated approaches for identifying signatures. Modern approaches that are capable of leveraging all available microbial genomes will ensure sensitive and accurate DNA signature identification and enable robust pathogen detection for clinical, environmental, and public health applications. ResultsSeqwin builds weighted pan-genome minimizer graphs and uses a traversal algorithm to identify signature sequences that occur frequently in target genomes but remain rare in non-targets. Unlike earlier tools that depend on strict presence or absence of sequences, Seqwin accommodates natural sequence variation and scales to very large genome collections. When applied to genomes from C. difficile, M. tuberculosis, and S. enterica, Seqwin recovered more high-quality signatures than alternative methods with lower computational burden. Seqwin’s analysis of nearly 15,000 S. enterica genomes yielded over 200 candidate signatures in 5 minutes. Seqwin provides an open-source solution for the long-standing need for scalable microbial signature discovery and diagnostic assay design. Availability and ImplementationSeqwin is freely available for academic use (https://github.com/treangenlab/Seqwin) and can be installed via Bioconda. Benchmarking datasets, outputs, and scripts are available on Zenodohttps://doi.org/10.5281/zenodo.19176444. Contacttreangen@rice.edu,xw66@rice.edu Supplementary MaterialsProvided as separate PDF and data files.more » « less
-
ABSTRACT MotivationStrain-level microbiome profiling has revealed key insights into microbial community composition and strain dynamics. However, accurate strain-level analysis remains challenging due to limited linkage information, ambiguous read mapping, and complicating factors such as genome similarity, sequencing depth, and community complexity. These challenges are especially pronounced for short-read metagenomic data when estimating the relative abundances of multiple strains, a task critical for genotype-phenotype association studies. ResultsTo address this gap, we present Strainify, which enables accurate strain-level abundance estimation from short-read metagenomes with as little as 1% genome coverage. Specifically, Strainify combines (1) identification of informative variants via core genome alignment, (2) filtering of confounding variants via a window-based test, and (3) maximum likelihood estimation of strain abundances. A Shannon entropy-weighted version of the model further improves robustness in noisy, low-coverage settings by downweighting sites with low information content. Across simulated communities of varying complexity, Strainify consistently outperformed existing approaches. On mock community sequencing data, Strainify’s estimates aligned more closely with reference abundances. When applied to a longitudinal gut microbiome dataset, Strainify successfully recapitulated the reported temporal dynamics ofBacteroides ovatusstrain groups, demonstrating its ability to recover biologically meaningful patterns from real-world metagenomes. Together, these results establish Strainify as a robust and versatile solution for accurate strain-level abundance estimation in short-read, low-coverage microbiome studies. AvailabilityThe Strainify code and results are available at:https://github.com/treangenlab/Strainifymore » « less
-
Schwartz, Russell (Ed.)Abstract MotivationSince 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. ResultsTo address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability and implementationParsnp v2 is available at https://github.com/marbl/parsnp.more » « less
-
Robinson, Peter (Ed.)Abstract MotivationThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. ResultsTo address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. Availability and implementationMashMap3 is available at https://github.com/marbl/MashMap.more » « less
-
Abstract Motivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining. Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux. Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.more » « less
-
Abstract 16S rRNA targeted amplicon sequencing is an established standard for elucidating microbial community composition. While high‐throughput short‐read sequencing can elicit only a portion of the 16S rRNA gene due to their limited read length, third generation sequencing can read the 16S rRNA gene in its entirety and thus provide more precise taxonomic classification. Here, we present a protocol for generating full‐length 16S rRNA sequences with Oxford Nanopore Technologies (ONT) and a microbial community profile with Emu. We select Emu for analyzing ONT sequences as it leverages information from the entire community to overcome errors due to incomplete reference databases and hardware limitations to ultimately obtain species‐level resolution. This pipeline provides a low‐cost solution for characterizing microbiome composition by exploiting real‐time, long‐read ONT sequencing and tailored software for accurate characterization of microbial communities. © 2024 Wiley Periodicals LLC. Basic Protocol: Microbial community profiling with Emu Support Protocol 1: Full‐length 16S rRNA microbial sequences with Oxford Nanopore Technologies sequencing platform Support Protocol 2: Building a custom reference database for Emumore » « less
-
As viral sequencing datasets continue to grow, traditional alignment-based variant calling pipelines are becoming computationally prohibitive. To address these challenges, we developedbronko, an ultrafast alignment-free framework for detecting viral variation directly from sequencing data. The novel computational approach implemented inbronkoallows scaling to massive viral sequencing datasets and has three key components: i) a locality-sensitive bucketing function to rapidly identify single-nucleotide polymorphisms (SNPs) relative to reference(s), ii) a direct k-mer count psuedo-mapping approach that approximates a pileup without alignment, and iii) a streaming-based sliding window outlier test to estimate baseline noise across the genome and precisely differentiate real minor variants from noise. Together, these components yield near-linear computational complexity with respect to sequencing depth, enabling bronko to process thousands of viral samples rapidly on modest hardware. Our results are threefold: 1) On simulated amplicon sequencing,bronkorecovers variants with higher precision and comparable recall to existing tools while running up to one to three orders of magnitude faster; 2)bronkogenerates sequence alignments directly from sequencing data, with SNP content similar to that of whole-genome alignment while also running in a fraction of the time, and 3) applyingbronkoto longitudinal sequencing data from chronically infected SARS-CoV-2 patients revealed consistent patterns of intrahost diversification and adaptive mutations over time. Altogether, these results demonstratebronko's potential as a scalable tool for large-scale viral genomic analyses, overcoming longstanding computational barriers for intrahost and interhost characterization of viral variation.more » « less
An official website of the United States government
