skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ImputeCC enhances integrative Hi-C-based metagenomic binning through constrained random-walk-based imputation
Metagenomic Hi-C (metaHi-C) enables the recognition of relationships between contigs in terms of their physical proximity within the same cell, facilitating the reconstruction of high-quality metagenomeassembled genomes (MAGs) from complex microbial communities. However, current Hi-C-based contig binning methods solely depend on Hi-C interactions between contigs to group them, ignoring invaluable biological information, including the presence of single-copy marker genes. Here, we introduce ImputeCC, an integrative contig binning tool tailored for metaHi-C datasets. ImputeCC integrates Hi-C interactions with the inherent discriminative power of single-copy marker genes, initially clustering them as preliminary bins, and develops a new constrained random walk with restart (CRWR) algorithm to improve Hi-C connectivity among these contigs. Extensive evaluations on mock and real metaHi-C datasets from diverse environments, including the human gut, wastewater, cow rumen, and sheep gut, demonstrate that ImputeCC consistently outperforms other Hi-C-based contig binning tools. ImputeCC’s genuslevel analysis of the sheep gut microbiota further reveals its ability and potential to recover essential species from dominant genera such as Bacteroides, detect previously unrecognized genera, and shed light on the characteristics and functional roles of genera such as Alistipes within the sheep gut ecosystem. Availability: ImputeCC is implemented in Python and available at https://github.com/dyxstat/ImputeCC. The Supplementary Information is available at https://doi.org/10.5281/zenodo.10776604.  more » « less
Award ID(s):
2125142
PAR ID:
10558941
Author(s) / Creator(s):
; ;
Editor(s):
Ma, Jian
Publisher / Repository:
Springer Nature
Date Published:
ISBN:
978-1071639887
Page Range / eLocation ID:
99-114
Subject(s) / Keyword(s):
Metagenomic Hi-C · Integrative Contig Binning · MetaHi-C Contact Map Imputation · Constrained Random Walk With Restart
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analysis methods are developed and benchmarked on short-read metaHi-C datasets and there exists much room for improvement in terms of more scalable and stable analyses, especially for long-read metaHi-C data. Here we report MetaCC, an efficient and integrative framework for analyzing both short-read and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning. In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than the current state-of-the-art method HiCzin on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module can retrieve 709 high-quality genomes with the largest species diversity using one single sample, including an expansion of five uncultured members from the orderErysipelotrichales, and is the only binner that can recover the genome of one important speciesBacteroides vulgatus. Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy plasmids. 
    more » « less
  2. Abstract MotivationMetagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample. ResultsWe develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs. Availability and implementationHiFine is available at https://github.com/dyxstat/HiFine. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available athttps://github.com/dyxstat/ViralCC. 
    more » « less
  4. Abstract BackgroundExploring metagenomic contigs and “binning” them into metagenome-assembled genomes (MAGs) are essential for the delineation of functional and evolutionary guilds within microbial communities. Despite the advances in automated binning algorithms, their capabilities in recovering MAGs with accuracy and biological relevance are so far limited. Researchers often find that human involvement is necessary to achieve representative binning results. This manual process however is expertise demanding and labor intensive, and it deserves to be supported by software infrastructure. ResultsWe present BinaRena, a comprehensive and versatile graphic interface dedicated to aiding human operators to explore metagenome assemblies via customizable visualization and to associate contigs with bins. Contigs are rendered as an interactive scatter plot based on various data types, including sequence metrics, coverage profiles, taxonomic assignments, and functional annotations. Various contig-level operations are permitted, such as selection, masking, highlighting, focusing, and searching. Binning plans can be conveniently edited, inspected, and compared visually or using metrics including silhouette coefficient and adjusted Rand index. Completeness and contamination of user-selected contigs can be calculated in real time.In demonstration of BinaRena’s usability, we show that it facilitated biological pattern discovery, hypothesis generation, and bin refinement in a complex tropical peatland metagenome. It enabled isolation of pathogenic genomes within closely related populations from the gut microbiota of diarrheal human subjects. It significantly improved overall binning quality after curating results of automated binners using a simulated marine dataset. ConclusionsBinaRena is an installation-free, dependency-free, client-end web application that operates directly in any modern web browser, facilitating ease of deployment and accessibility for researchers of all skill levels. The program is hosted athttps://github.com/qiyunlab/binarena, together with documentation, tutorials, example data, and a live demo. It effectively supports human researchers in intuitive interpretation and fine tuning of metagenomic data. 
    more » « less
  5. Wheat, Christopher (Ed.)
    Abstract The blackstripe livebearer Poeciliopsis prolifica is a live-bearing fish belonging to the family Poeciliidae with high level of postfertilization maternal investment (matrotrophy). This viviparous matrotrophic species has evolved a structure similarly to the mammalian placenta. Placentas have independently evolved multiple times in Poeciliidae from nonplacental ancestors, which provide an opportunity to study the placental evolution. However, there is a lack of high-quality reference genomes for the placental species in Poeciliidae. In this study, we present a 674 Mb assembly of P. prolifica in 504 contigs with excellent continuity (contig N50 7.7 Mb) and completeness (97.2% Benchmarking Universal Single-Copy Orthologs [BUSCO] completeness score, including 92.6% single-copy and 4.6% duplicated BUSCO score). A total of 27,227 protein-coding genes were annotated from the merged datasets based on bioinformatic prediction, RNA sequencing and homology evidence. Phylogenomic analyses revealed that P. prolifica diverged from the guppy (Poecilia reticulata) ∼19 Ma. Our research provides the necessary resources and the genomic toolkit for investigating the genetic underpinning of placentation. 
    more » « less