Abstract MotivationMetagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample. ResultsWe develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs. Availability and implementationHiFine is available at https://github.com/dyxstat/HiFine. Supplementary informationSupplementary data are available at Bioinformatics online.
more »
« less
Phage–bacterial contig association prediction with a convolutional neural network
Abstract MotivationPhage–host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage–host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. ResultsOn the validation set, ContigNet achieves 72–85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60–70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. Availability and implementationThe source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.
more »
« less
- Award ID(s):
- 2125142
- PAR ID:
- 10368273
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 38
- Issue:
- Supplement_1
- ISSN:
- 1367-4803
- Format(s):
- Medium: X Size: p. i45-i52
- Size(s):
- p. i45-i52
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Bacteriophages are obligate parasites of bacteria characterized by the breadth of hosts that they can infect. This “host range” depends on the genotypes and morphologies of the phage and the bacterial host, but also on the environment in which they are interacting. Understanding phage host range is critical to predicting the impacts of these parasites in their natural host communities and their utility as therapeutic agents, but is also key to predicting how phages evolve and in doing so drive evolutionary change in their host populations, including through movement of genes among unrelated bacterial genomes. Here, we explore the drivers of phage infection and host range from the molecular underpinnings of the phage–host interaction to the ecological context in which they occur. We further evaluate the importance of intrinsic, transient, and environmental drivers shaping phage infection and replication, and discuss how each influences host range over evolutionary time. The host range of phages has great consequences in phage-based application strategies, as well as natural community dynamics, and we therefore highlight both recent developments and key open questions in the field as phage-based therapeutics come back into focus.more » « less
-
ABSTRACT Bacteriophages are the most abundant and diverse biological entities on the planet, and new phage genomes are being discovered at a rapid pace. As more phage genomes are published, new methods are needed for placing these genomes in an ecological and evolutionary context. Phages are difficult to study by phylogenetic methods, because they exchange genes regularly, and no single gene is conserved across all phages. Here, we demonstrate how gene-level networks can provide a high-resolution view of phage genetic diversity and offer a novel perspective on virus ecology. We focus our analyses on virus host range and show how network topology corresponds to host relatedness, how to find groups of genes with the strongest host-specific signatures, and how this perspective can complement phage host prediction tools. We discuss extensions of gene network analysis to predicting the emergence of phages on new hosts, as well as applications to features of phage biology beyond host range. IMPORTANCE Bacteriophages (phages) are viruses that infect bacteria, and they are critical drivers of bacterial evolution and community structure. It is generally difficult to study phages by using tree-based methods, because gene exchange is common, and no single gene is shared among all phages. Instead, networks offer a means to compare phages while placing them in a broader ecological and evolutionary context. In this work, we build a network that summarizes gene sharing across phages and test how a key constraint on phage ecology, host range, corresponds to the structure of the network. We find that the network reflects the relatedness among phage hosts, and phages with genes that are closer in the network are likelier to infect similar hosts. This approach can also be used to identify genes that affect host range, and we discuss possible extensions to analyze other aspects of viral ecology.more » « less
-
Abstract BackgroundSeagrasses are globally distributed marine flowering plants that play foundational roles in coastal environments as ecosystem engineers. While research efforts have explored various aspects of seagrass-associated microbial communities, including describing the diversity of bacteria, fungi and microbial eukaryotes, little is known about viral diversity in these communities. ResultsTo begin to address this, we leveraged metagenomic sequencing data to generate a catalog of bacterial metagenome-assembled genomes (MAGs) and phage genomes from the leaves of the seagrass,Zostera marina. We expanded the robustness of this viral catalog by incorporating publicly available metagenomic data from seagrass ecosystems. The final MAG set represents 85 high-quality draft and 62 medium-quality draft bacterial genomes. While the viral catalog represents 354 medium-quality, high-quality, and complete viral genomes. Predicted auxiliary metabolic genes in the final viral catalog had putative annotations largely related to carbon utilization, suggesting a possible role for phage in carbon cycling in seagrass ecosystems. ConclusionsThese genomic resources provide initial insight into bacterial-viral interactions in seagrass meadows and are a foundation on which to further explore these critical interkingdom interactions. These catalogs highlight a possible role for viruses in carbon cycling in seagrass beds which may have important implications for blue carbon management and climate change mitigation.more » « less
-
Abstract The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available athttps://github.com/dyxstat/ViralCC.more » « less
An official website of the United States government
