skip to main content

Title: Phage–bacterial contig association prediction with a convolutional neural network
Abstract Motivation

Phage–host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage–host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH.


On the validation set, ContigNet achieves 72–85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60–70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts.

Availability and implementation

The source code of ContigNet and related datasets can be downloaded from

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. i45-i52
p. i45-i52
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Although the use of long-read sequencing improves the contiguity of assembled viral genomes compared to short-read methods, assembling complex viral communities remains an open problem. We describe the viralFlye tool for identification and analysis of metagenome-assembled viruses in long-read assemblies. We show it significantly improves viral assemblies and demonstrate that long-reads result in a much larger array of predicted virus-host associations as compared to short-read assemblies. We demonstrate that the identification of novel CRISPR arrays in bacterial genomes from a newly assembled metagenomic sample provides information for predicting novel hosts for novel viruses.

    more » « less
  2. Abstract Motivation

    Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.


    We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.

    Availability and implementation

    HiFine is available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  3. Abstract

    Bacteriophages are obligate parasites of bacteria characterized by the breadth of hosts that they can infect. This “host range” depends on the genotypes and morphologies of the phage and the bacterial host, but also on the environment in which they are interacting. Understanding phage host range is critical to predicting the impacts of these parasites in their natural host communities and their utility as therapeutic agents, but is also key to predicting how phages evolve and in doing so drive evolutionary change in their host populations, including through movement of genes among unrelated bacterial genomes. Here, we explore the drivers of phage infection and host range from the molecular underpinnings of the phage–host interaction to the ecological context in which they occur. We further evaluate the importance of intrinsic, transient, and environmental drivers shaping phage infection and replication, and discuss how each influences host range over evolutionary time. The host range of phages has great consequences in phage-based application strategies, as well as natural community dynamics, and we therefore highlight both recent developments and key open questions in the field as phage-based therapeutics come back into focus.

    more » « less
  4. ABSTRACT Bacteriophages are the most abundant and diverse biological entities on the planet, and new phage genomes are being discovered at a rapid pace. As more phage genomes are published, new methods are needed for placing these genomes in an ecological and evolutionary context. Phages are difficult to study by phylogenetic methods, because they exchange genes regularly, and no single gene is conserved across all phages. Here, we demonstrate how gene-level networks can provide a high-resolution view of phage genetic diversity and offer a novel perspective on virus ecology. We focus our analyses on virus host range and show how network topology corresponds to host relatedness, how to find groups of genes with the strongest host-specific signatures, and how this perspective can complement phage host prediction tools. We discuss extensions of gene network analysis to predicting the emergence of phages on new hosts, as well as applications to features of phage biology beyond host range. IMPORTANCE Bacteriophages (phages) are viruses that infect bacteria, and they are critical drivers of bacterial evolution and community structure. It is generally difficult to study phages by using tree-based methods, because gene exchange is common, and no single gene is shared among all phages. Instead, networks offer a means to compare phages while placing them in a broader ecological and evolutionary context. In this work, we build a network that summarizes gene sharing across phages and test how a key constraint on phage ecology, host range, corresponds to the structure of the network. We find that the network reflects the relatedness among phage hosts, and phages with genes that are closer in the network are likelier to infect similar hosts. This approach can also be used to identify genes that affect host range, and we discuss possible extensions to analyze other aspects of viral ecology. 
    more » « less
  5. Background

    The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference‐based and gene homology‐based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.


    Here we developed a reference‐free and alignment‐free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.


    Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state‐of‐the‐art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under‐represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.


    Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

    more » « less