skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data

The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available at

more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Recovering high-quality metagenome-assembled genomes (MAGs) from complex microbial ecosystems remains challenging. Recently, high-throughput chromosome conformation capture (Hi-C) has been applied to simultaneously study multiple genomes in natural microbial communities. We develop HiCBin, a novel open-source pipeline, to resolve high-quality MAGs utilizing Hi-C contact maps. HiCBin employs the HiCzin normalization method and the Leiden clustering algorithm and includes the spurious contact detection into binning pipelines for the first time. HiCBin is validated on one synthetic and two real metagenomic samples and is shown to outperform the existing Hi-C-based binning methods. HiCBin is available at . 
    more » « less
  2. Abstract Motivation

    Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.


    We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.

    Availability and implementation

    HiFine is available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  3. Abstract Background

    Exploring metagenomic contigs and “binning” them into metagenome-assembled genomes (MAGs) are essential for the delineation of functional and evolutionary guilds within microbial communities. Despite the advances in automated binning algorithms, their capabilities in recovering MAGs with accuracy and biological relevance are so far limited. Researchers often find that human involvement is necessary to achieve representative binning results. This manual process however is expertise demanding and labor intensive, and it deserves to be supported by software infrastructure.


    We present BinaRena, a comprehensive and versatile graphic interface dedicated to aiding human operators to explore metagenome assemblies via customizable visualization and to associate contigs with bins. Contigs are rendered as an interactive scatter plot based on various data types, including sequence metrics, coverage profiles, taxonomic assignments, and functional annotations. Various contig-level operations are permitted, such as selection, masking, highlighting, focusing, and searching. Binning plans can be conveniently edited, inspected, and compared visually or using metrics including silhouette coefficient and adjusted Rand index. Completeness and contamination of user-selected contigs can be calculated in real time.

    In demonstration of BinaRena’s usability, we show that it facilitated biological pattern discovery, hypothesis generation, and bin refinement in a complex tropical peatland metagenome. It enabled isolation of pathogenic genomes within closely related populations from the gut microbiota of diarrheal human subjects. It significantly improved overall binning quality after curating results of automated binners using a simulated marine dataset.


    BinaRena is an installation-free, dependency-free, client-end web application that operates directly in any modern web browser, facilitating ease of deployment and accessibility for researchers of all skill levels. The program is hosted at, together with documentation, tutorials, example data, and a live demo. It effectively supports human researchers in intuitive interpretation and fine tuning of metagenomic data.

    more » « less
  4. Abstract Background

    Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling.


    We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, or single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric “MW-score” (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut.


    METABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 at

    more » « less
  5. Abstract Motivation

    Phage–host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage–host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH.


    On the validation set, ContigNet achieves 72–85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60–70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts.

    Availability and implementation

    The source code of ContigNet and related datasets can be downloaded from

    more » « less