NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Quantifying microbial interactions based on compositional data using an iterative approach for solving generalized Lotka-Volterra equations

https://doi.org/10.1371/journal.pcbi.1013691

Huang, Yue; Tang, Tianqi; Dai, Xiaowu; Sun, Fengzhu (November 2025, PLOS Computational Biology)
Zhu, Shanfeng (Ed.)
Understanding microbial interactions is fundamental for exploring population dynamics, particularly in microbial communities where interactions affect stability and host health. Generalized Lotka-Volterra (gLV) models have been widely used to investigate system dynamics but depend on absolute abundance data, which are often unavailable in microbiome studies. To address this limitation, we introduce an iterative Lotka-Volterra (iLV) model, a novel framework tailored for compositional data that leverages relative abundances and iterative refinements for parameter estimation. The iLV model features two key innovations: an adaptation of the gLV framework to compositional constraints and an iterative optimization strategy combining linear approximations with nonlinear refinements to enhance parameter estimation accuracy. Using simulations and real-world datasets, we demonstrate that iLV surpasses existing methodologies, such as the compositional LV (cLV) and the generalized LV (gLV) model, in recovering interaction coefficients and predicting species trajectories under varying noise levels and temporal resolutions. Applications to the lynx-hare predator-prey,Stylonychia pustula-P. caudatummixed culture, and cheese microbial systems revealed consistency between predicted and observed relative abundances showcasing its accuracy and robustness. In summary, the iLV model bridges theoretical gLV models and practical compositional data analysis, offering a robust framework to infer microbial interactions and predict community dynamics using relative abundance data, with significant potential for advancing microbial research.
more » « less
Free, publicly-accessible full text available November 7, 2026
Bidirectional subsethood of shared marker profiles enables accurate virus classification

https://doi.org/10.1186/s40168-025-02159-x

Riccardi, Christopher; Wang, Yuqiu; Yooseph, Shibu; Sun, Fengzhu (July 2025, Microbiome)
Imputing Metagenomic Hi-C Contacts Facilitates the Integrative Contig Binning Through Constrained Random Walk with Restart

https://doi.org/10.1089/cmb.2024.0663

Du, Yuxuan; Zuo, Wenxuan; Sun, Fengzhu (October 2024, Journal of Computational Biology)

Full Text Available
A comparative analysis of mutual information methods for pairwise relationship detection in metagenomic data

https://doi.org/10.1186/s12859-024-05883-7

Francis, Dallace; Sun, Fengzhu (August 2024, BMC Bioinformatics)
Analysis of metagenomic data

https://doi.org/10.1038/s43586-024-00376-6

Liu, Shaopeng; Rodriguez, Judith S; Munteanu, Viorel; Ronkowski, Cynthia; Sharma, Nitesh Kumar; Alser, Mohammed; Andreace, Francesco; Blekhman, Ran; Błaszczyk, Dagmara; Chikhi, Rayan; et al (December 2025, Nature Reviews Methods Primers)

Metagenomics has revolutionized our understanding of microbial communities, offering unprecedented insights into their genetic and functional diversity across Earth’s diverse ecosystems. Beyond their roles as environmental constituents, microbiomes act as symbionts, profoundly influencing the health and function of their host organisms. Given the inherent complexity of these communities and the diverse environments where they reside, the components of a metagenomics study must be carefully tailored to yield accurate results that are representative of the populations of interest. This Primer examines the methodological advancements and current practices that have shaped the field, from initial stages of sample collection and DNA extraction to the advanced bioinformatics tools employed for data analysis, with a particular focus on the profound impact of next-generation sequencing on the scale and accuracy of metagenomics studies. We critically assess the challenges and limitations inherent in metagenomics experimentation, available technologies and computational analysis methods. Beyond technical methodologies, we explore the application of metagenomics across various domains, including human health, agriculture and environmental monitoring. Looking ahead, we advocate for the development of more robust computational frameworks and enhanced interdisciplinary collaborations. This Primer serves as a comprehensive guide for advancing the precision and applicability of metagenomic studies, positioning them to address the complexities of microbial ecology and their broader implications for human health and environmental sustainability.
more » « less
Free, publicly-accessible full text available December 1, 2026
ImputeCC enhances integrative Hi-C-based metagenomic binning through constrained random-walk-based imputation

Du, Yuxuan; Zuo, Wenxuan; Sun, Fengzhu (May 2024, Springer Nature)
Ma, Jian (Ed.)
Metagenomic Hi-C (metaHi-C) enables the recognition of relationships between contigs in terms of their physical proximity within the same cell, facilitating the reconstruction of high-quality metagenomeassembled genomes (MAGs) from complex microbial communities. However, current Hi-C-based contig binning methods solely depend on Hi-C interactions between contigs to group them, ignoring invaluable biological information, including the presence of single-copy marker genes. Here, we introduce ImputeCC, an integrative contig binning tool tailored for metaHi-C datasets. ImputeCC integrates Hi-C interactions with the inherent discriminative power of single-copy marker genes, initially clustering them as preliminary bins, and develops a new constrained random walk with restart (CRWR) algorithm to improve Hi-C connectivity among these contigs. Extensive evaluations on mock and real metaHi-C datasets from diverse environments, including the human gut, wastewater, cow rumen, and sheep gut, demonstrate that ImputeCC consistently outperforms other Hi-C-based contig binning tools. ImputeCC’s genuslevel analysis of the sheep gut microbiota further reveals its ability and potential to recover essential species from dominant genera such as Bacteroides, detect previously unrecognized genera, and shed light on the characteristics and functional roles of genera such as Alistipes within the sheep gut ecosystem. Availability: ImputeCC is implemented in Python and available at https://github.com/dyxstat/ImputeCC. The Supplementary Information is available at https://doi.org/10.5281/zenodo.10776604.
more » « less
Full Text Available
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies

https://doi.org/10.1371/journal.pcbi.1010608

Gao, Yilin; Sun, Fengzhu (October 2023, PLOS Computational Biology)
Roy, Sushmita (Ed.)
Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier’s reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier’s prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.
more » « less
Full Text Available
DeepMicroClass sorts metagenomic contigs into prokaryotes, eukaryotes and viruses

https://doi.org/10.1093/nargab/lqae044

Hou, Shengwei; Tang, Tianqi; Cheng, Siliangyu; Liu, Yuanhao; Xia, Tian; Chen, Ting; Fuhrman, Jed A.; Sun, Fengzhu (May 2024, NAR Genomics and Bioinformatics)

Abstract Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.
more » « less
MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data

https://doi.org/10.1038/s41467-023-41209-6

Du, Yuxuan; Sun, Fengzhu (October 2023, Nature Communications)

Abstract Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analysis methods are developed and benchmarked on short-read metaHi-C datasets and there exists much room for improvement in terms of more scalable and stable analyses, especially for long-read metaHi-C data. Here we report MetaCC, an efficient and integrative framework for analyzing both short-read and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning. In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than the current state-of-the-art method HiCzin on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module can retrieve 709 high-quality genomes with the largest species diversity using one single sample, including an expansion of five uncultured members from the orderErysipelotrichales, and is the only binner that can recover the genome of one important speciesBacteroides vulgatus. Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy plasmids.
more » « less
HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps

https://doi.org/10.1186/s13059-022-02626-w

Du, Yuxuan; Sun, Fengzhu (December 2022, Genome Biology)

Abstract Recovering high-quality metagenome-assembled genomes (MAGs) from complex microbial ecosystems remains challenging. Recently, high-throughput chromosome conformation capture (Hi-C) has been applied to simultaneously study multiple genomes in natural microbial communities. We develop HiCBin, a novel open-source pipeline, to resolve high-quality MAGs utilizing Hi-C contact maps. HiCBin employs the HiCzin normalization method and the Leiden clustering algorithm and includes the spurious contact detection into binning pipelines for the first time. HiCBin is validated on one synthetic and two real metagenomic samples and is shown to outperform the existing Hi-C-based binning methods. HiCBin is available at https://github.com/dyxstat/HiCBin .
more » « less
Full Text Available

« Prev Next »

Search for: All records