NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

tMHG-Finder: Tree-Guided Maximal Homologous Group Finder for Bacterial Genomes

https://doi.org/10.1007/978-3-031-94928-9_6

Yin, Yongze; Kille, Bryce; Ogilvie, Huw A; Treangen, Todd J; Nakhleh, Luay (September 2025, Springer Nature Switzerland)

Free, publicly-accessible full text available September 1, 2026
Graph-based self-supervised learning for repeat detection in metagenomic assembly

https://doi.org/10.1101/gr.279136.124

Azizpour, Ali; Balaji, Advait; Treangen, Todd J; Segarra, Santiago (July 2024, Genome research)

Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
more » « less
Full Text Available
Rapid whole genome characterization of antimicrobial-resistant pathogens using long-read sequencing to identify potential healthcare transmission

https://doi.org/10.1017/ice.2024.202

Wu, Chin-Ting; Shropshire, William C; Bhatti, Micah M; Cantu, Sherry; Glover, Israel K; Anand, Selvalakshmi Selvaraj; Liu, Xiaojun; Kalia, Awdhesh; Treangen, Todd J; Chemaly, Roy F; et al (February 2025, Infection Control & Hospital Epidemiology)

Abstract Objective:Whole genome sequencing (WGS) can help identify transmission of pathogens causing healthcare-associated infections (HAIs). However, the current gold standard of short-read, Illumina-based WGS is labor and time intensive. Given recent improvements in long-read Oxford Nanopore Technologies (ONT) sequencing, we sought to establish a low resource approach providing accurate WGS-pathogen comparison within a time frame allowing for infection prevention and control (IPC) interventions. Methods:WGS was prospectively performed on pathogens at increased risk of potential healthcare transmission using the ONT MinION sequencer with R10.4.1 flow cells and Dorado basecaller. Potential transmission was assessed via Ridom SeqSphere+ for core genome multilocus sequence typing and MINTyper for reference-based core genome single nucleotide polymorphisms using previously published cutoff values. The accuracy of our ONT pipeline was determined relative to Illumina. Results:Over a six-month period, 242 bacterial isolates from 216 patients were sequenced by a single operator. Compared to the Illumina gold standard, our ONT pipeline achieved a mean identity score of Q60 for assembled genomes, even with a coverage rate as low as 40×. The mean time from initiating DNA extraction to complete analysis was 2 days (IQR 2–3.25 days). We identified five potential transmission clusters comprising 21 isolates (8.7% of sequenced strains). Integrating ONT with epidemiological data, >70% (15/21) of putative transmission cluster isolates originated from patients with potential healthcare transmission links. Conclusions:Via a stand-alone ONT pipeline, we detected potentially transmitted HAI pathogens rapidly and accurately, aligning closely with epidemiological data. Our low-resource method has the potential to assist in IPC efforts.
more » « less
Free, publicly-accessible full text available February 1, 2026
A class of benzofuranoindoline-bearing heptacyclic fungal RiPPs with anticancer activities

https://doi.org/10.1038/s41589-025-01946-9

Nie, Qiuyue; Zhao, Fanglong; Yu, Xuerong; Madhusudhanan, Mithun C; Chang, Caleb; Li, Siting; Chowdhury, Sandipan Roy; Kille, Bryce; Xu, Andy; Sharkey, Rory; et al (June 2025, Nature Chemical Biology)

Free, publicly-accessible full text available June 23, 2026
Parsnp 2.0: scalable core-genome alignment for massive microbial datasets

https://doi.org/10.1093/bioinformatics/btae311

Kille, Bryce; Nute, Michael G; Huang, Victor; Kim, Eddie; Phillippy, Adam M; Treangen, Todd J (May 2024, Bioinformatics)
Schwartz, Russell (Ed.)
Abstract MotivationSince 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. ResultsTo address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability and implementationParsnp v2 is available at https://github.com/marbl/parsnp.
more » « less
Full Text Available
Unveiling microbial diversity: harnessing long-read sequencing technology

https://doi.org/10.1038/s41592-024-02262-1

Agustinho, Daniel P.; Fu, Yilei; Menon, Vipin K.; Metcalf, Ginger A.; Treangen, Todd J.; Sedlazeck, Fritz J. (April 2024, Nature Methods)

Full Text Available
KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome

https://doi.org/10.1142/9789811286421_0039

Sapoval, Nicolae; Tanevski, Marko; Treangen, Todd J. (December 2023, Pacific Symposium on Biocomputing 2024)

The microbes present in the human gastrointestinal tract are regularly linked to humanhealth and disease outcomes. Thanks to technological and methodological advances in re-cent years, metagenomic sequencing data, and computational methods designed to analyzemetagenomic data, have contributed to improved understanding of the link between thehuman gut microbiome and disease. However, while numerous methods have been recentlydeveloped to extract quantitative and qualitative results from host-associated microbiomedata, improved computational tools are still needed to track microbiome dynamics withshort-read sequencing data. Previously we have proposed KOMB as ade novotool foridentifying copy number variations in metagenomes for characterizing microbial genomedynamics in response to perturbations. In this work, we present KombOver (KO), whichincludes four key contributions with respect to our previous work: (i) it scales to largemicrobiome study cohorts, (ii) it includes bothk-core andK-truss based analysis, (iii)we provide the foundation of a theoretical understanding of the relation between variousgraph-based metagenome representations, and (iv) we provide an improved user experiencewith easier-to-run code and more descriptive outputs/results. To highlight the aforemen-tioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring lessthan 10 minutes and 10 GB RAM per sample to process these data. Furthermore, wehighlight how graph-based approaches such ask-core andK-truss can be informative forpinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fa-tigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at:https://github.com/treangenlab/komb
more » « less
Full Text Available
Microbial Community Profiling Protocol with Full‐length 16S rRNA Sequences and Emu

https://doi.org/10.1002/cpz1.978

Curry, Kristen D.; Soriano, Sirena; Nute, Michael G.; Villapol, Sonia; Dilthey, Alexander; Treangen, Todd J. (March 2024, Current Protocols)

Abstract 16S rRNA targeted amplicon sequencing is an established standard for elucidating microbial community composition. While high‐throughput short‐read sequencing can elicit only a portion of the 16S rRNA gene due to their limited read length, third generation sequencing can read the 16S rRNA gene in its entirety and thus provide more precise taxonomic classification. Here, we present a protocol for generating full‐length 16S rRNA sequences with Oxford Nanopore Technologies (ONT) and a microbial community profile with Emu. We select Emu for analyzing ONT sequences as it leverages information from the entire community to overcome errors due to incomplete reference databases and hardware limitations to ultimately obtain species‐level resolution. This pipeline provides a low‐cost solution for characterizing microbiome composition by exploiting real‐time, long‐read ONT sequencing and tailored software for accurate characterization of microbial communities. © 2024 Wiley Periodicals LLC. Basic Protocol: Microbial community profiling with Emu Support Protocol 1: Full‐length 16S rRNA microbial sequences with Oxford Nanopore Technologies sequencing platform Support Protocol 2: Building a custom reference database for Emu
more » « less
Full Text Available
PreK-12 school and citywide wastewater monitoring of the enteric viruses astrovirus, rotavirus, and sapovirus

https://doi.org/10.1016/j.scitotenv.2024.172683

Wolken, Madeline; Wang, Michael; Schedler, Julia; Campos, Roberto H.; Ensor, Katherine; Hopkins, Loren; Treangen, Todd; Stadler, Lauren B. (April 2024, Science of The Total Environment)

Full Text Available
Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence

https://doi.org/10.1145/3584371.3612953

Quintana, Felix; Treangen, Todd; Kavraki, Lydia (September 2023, BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics)

In the aftermath of COVID-19, screening for pathogens has never been a more relevant problem. However, computational screening for pathogens is challenging due to a variety of factors, including (i) the complexity and role of the host, (ii) virulence factor divergence and dynamics, and (iii) population and community-level dynamics. Considering a potential pathogen's molecular interactions, specifically individual proteins and protein interactions can help pinpoint a potential protein of a given microbe to cause disease. However, existing tools for pathogen screening rely on existing annotations (KEGG, GO, etc), making the assessment of novel and unannotated proteins more challenging. Here, we present an LLM-inspired approach that considers protein sequence and structure to predict protein virulence. We present a two-stage model incorporating evolutionary features captured from the DistilProtBert language model and protein structure in a graph convolutional network. Our model performs better than sequence alone for virulence function when high-quality structures are present, thus representing a path forward for virulence prediction of novel and unannotated proteins.
more » « less
Full Text Available

« Prev Next »

Search for: All records