NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Syotti: scalable bait design for DNA enrichment

https://doi.org/10.1093/bioinformatics/btac226

Alanko, Jarno_N; Slizovskiy, Ilya_B; Lokshtanov, Daniel; Gagie, Travis; Noyes, Noelle_R; Boucher, Christina (June 2022, Bioinformatics)

Abstract MotivationBait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes (‘baits’) are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. ResultsWe formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h. Availability and implementationhttps://github.com/jnalanko/syotti. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Target-enriched long-read sequencing (TELSeq) contextualizes antimicrobial resistance genes in metagenomes

https://doi.org/10.1186/s40168-022-01368-y

Slizovskiy, Ilya B.; Oliva, Marco; Settle, Jonathen K.; Zyskina, Lidiya V.; Prosperi, Mattia; Boucher, Christina; Noyes, Noelle R. (December 2022, Microbiome)

Abstract Background Metagenomic data can be used to profile high-importance genes within microbiomes. However, current metagenomic workflows produce data that suffer from low sensitivity and an inability to accurately reconstruct partial or full genomes, particularly those in low abundance. These limitations preclude colocalization analysis, i.e., characterizing the genomic context of genes and functions within a metagenomic sample. Genomic context is especially crucial for functions associated with horizontal gene transfer (HGT) via mobile genetic elements (MGEs), for example antimicrobial resistance (AMR). To overcome this current limitation of metagenomics, we present a method for comprehensive and accurate reconstruction of antimicrobial resistance genes (ARGs) and MGEs from metagenomic DNA, termed t arget- e nriched l ong-read seq uencing (TELSeq). Results Using technical replicates of diverse sample types, we compared TELSeq performance to that of non-enriched PacBio and short-read Illumina sequencing. TELSeq achieved much higher ARG recovery (>1,000-fold) and sensitivity than the other methods across diverse metagenomes, revealing an extensive resistome profile comprising many low-abundance ARGs, including some with public health importance. Using the long reads generated by TELSeq, we identified numerous MGEs and cargo genes flanking the low-abundance ARGs, indicating that these ARGs could be transferred across bacterial taxa via HGT. Conclusions TELSeq can provide a nuanced view of the genomic context of microbial resistomes and thus has wide-ranging applications in public, animal, and human health, as well as environmental surveillance and monitoring of AMR. Thus, this technique represents a fundamental advancement for microbiome research and application.
more » « less
Full Text Available
MEGARes and AMR++, v3.0: an updated comprehensive database of antimicrobial resistance determinants and an improved software pipeline for classification using high-throughput sequencing

https://doi.org/10.1093/nar/gkac1047

Bonin, Nathalie; Doster, Enrique; Worley, Hannah; Pinnell, Lee J; Bravo, Jonathan E; Ferm, Peter; Marini, Simone; Prosperi, Mattia; Noyes, Noelle; Morley, Paul S; et al (November 2022, Nucleic Acids Research)

Abstract Antimicrobial resistance (AMR) is considered a critical threat to public health, and genomic/metagenomic investigations featuring high-throughput analysis of sequence data are increasingly common and important. We previously introduced MEGARes, a comprehensive AMR database with an acyclic hierarchical annotation structure that facilitates high-throughput computational analysis, as well as AMR++, a customized bioinformatic pipeline specifically designed to use MEGARes in high-throughput analysis for characterizing AMR genes (ARGs) in metagenomic sequence data. Here, we present MEGARes v3.0, a comprehensive database of published ARG sequences for antimicrobial drugs, biocides, and metals, and AMR++ v3.0, an update to our customized bioinformatic pipeline for high-throughput analysis of metagenomic data (available at MEGLab.org). Database annotations have been expanded to include information regarding specific genomic locations for single-nucleotide polymorphisms (SNPs) and insertions and/or deletions (indels) when required by specific ARGs for resistance expression, and the updated AMR++ pipeline uses this information to check for presence of resistance-conferring genetic variants in metagenomic sequenced reads. This new information encompasses 337 ARGs, whose resistance-conferring variants could not previously be confirmed in such a manner. In MEGARes 3.0, the nodes of the acyclic hierarchical ontology include 4 antimicrobial compound types, 59 resistance classes, 233 mechanisms and 1448 gene groups that classify the 8733 accessions.
more » « less
Full Text Available
Computing Maximal Unique Matches with the r-index

https://doi.org/10.4230/LIPIcs.SEA.2022.22

Giuliani, Sara; Romana, Giuseppe; Rossi, Massimiliano (January 2022, SEA (International Symposium on Experimental Algorithms))

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.
more » « less
Full Text Available

Search for: All records