skip to main content

Title: virMine: automated detection of viral sequences from complex metagenomic samples
Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Viruses of the phylum Nucleocytoviricota are ubiquitous in ocean waters and play important roles in shaping the dynamics of marine ecosystems. In this study, we leveraged the bioGEOTRACES metagenomic dataset collected across the Atlantic and Pacific Oceans to investigate the biogeography of these viruses in marine environments. We identified 330 viral genomes, including 212 in the order Imitervirales and 54 in the order Algavirales. We found that most viruses appeared to be prevalent in shallow waters (<150 m), and that viruses of the Mesomimiviridae (Imitervirales) and Prasinoviridae (Algavirales) are by far the most abundant and diverse groups in our survey. Five mesomimiviruses and one prasinovirus are particularly widespread in oligotrophic waters; annotation of these genomes revealed common stress response systems, photosynthesis-associated genes, and oxidative stress modulation genes that may be key to their broad distribution in the pelagic ocean. We identified a latitudinal pattern in viral diversity in one cruise that traversed the North and South Atlantic Ocean, with viral diversity peaking at high latitudes of the northern hemisphere. Community analyses revealed three distinct Nucleocytoviricota communities across latitudes, categorized by latitudinal distance towards the equator. Our results contribute to the understanding of the biogeography of these viruses in marine systems.

    more » « less
  2. Rappe, Michael S. (Ed.)
    ABSTRACT For the abundant marine Alphaproteobacterium Pelagibacter (SAR11), and other bacteria, phages are powerful forces of mortality. However, little is known about the most abundant Pelagiphages in nature, such as the widespread HTVC023P-type, which is currently represented by two cultured phages. Using viral metagenomic data sets and fluorescence-activated cell sorting, we recovered 80 complete, undescribed Podoviridae genomes that form 10 phylogenomically distinct clades (herein, named Clades I to X) related to the HTVC023P-type. These expanded the HTVC023P-type pan-genome by 15-fold and revealed 41 previously unknown auxiliary metabolic genes (AMGs) in this viral lineage. Numerous instances of partner-AMGs (colocated and involved in related functions) were observed, including partners in nucleotide metabolism, DNA hypermodification, and Curli biogenesis. The Type VIII secretion system (T8SS) responsible for Curli biogenesis was identified in nine genomes and expanded the repertoire of T8SS proteins reported thus far in viruses. Additionally, the identified T8SS gene cluster contained an iron-dependent regulator (FecR), as well as a histidine kinase and adenylate cyclase that can be implicated in T8SS function but are not within T8SS operons in bacteria. While T8SS are lacking in known Pelagibacter , they contribute to aggregation and biofilm formation in other bacteria. Phylogenetic reconstructions of partner-AMGs indicate derivation from cellular lineages with a more recent transfer between viral families. For example, homologs of all T8SS genes are present in syntenic regions of distant Myoviridae Pelagiphages, and they appear to have alphaproteobacterial origins with a later transfer between viral families. The results point to an unprecedented multipartner-AMG transfer between marine Myoviridae and Podoviridae. Together with the expansion of known metabolic functions, our studies provide new prospects for understanding the ecology and evolution of marine phages and their hosts. IMPORTANCE One of the most abundant and diverse marine bacterial groups is Pelagibacter . Phages have roles in shaping Pelagibacter ecology; however, several Pelagiphage lineages are represented by only a few genomes. This paucity of data from even the most widespread lineages has imposed limits on the understanding of the diversity of Pelagiphages and their impacts on hosts. Here, we report 80 complete genomes, assembled directly from environmental data, which are from undescribed Pelagiphages and render new insights into the manipulation of host metabolism during infection. Notably, the viruses have functionally related partner genes that appear to be transferred between distant viruses, including a suite that encode a secretion system which both brings a new functional capability to the host and is abundant in phages across the ocean. Together, these functions have important implications for phage evolution and for how Pelagiphage infection influences host biology in manners extending beyond canonical viral lysis and mortality. 
    more » « less
  3. null (Ed.)
    Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversity of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in ‘omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub: 
    more » « less
  4. Abstract Background Microbes and their viruses are hidden engines driving Earth’s ecosystems from the oceans and soils to humans and bioreactors. Though gene marker approaches can now be complemented by genome-resolved studies of inter-(macrodiversity) and intra-(microdiversity) population variation, analytical tools to do so remain scattered or under-developed. Results Here, we introduce MetaPop, an open-source bioinformatic pipeline that provides a single interface to analyze and visualize microbial and viral community metagenomes at both the macro - and microdiversity levels. Macrodiversity estimates include population abundances and α- and β-diversity. Microdiversity calculations include identification of single nucleotide polymorphisms, novel codon-constrained linkage of SNPs, nucleotide diversity ( π and θ ), and selective pressures (pN/pS and Tajima’s D ) within and fixation indices ( F ST ) between populations. MetaPop will also identify genes with distinct codon usage. Following rigorous validation, we applied MetaPop to the gut viromes of autistic children that underwent fecal microbiota transfers and their neurotypical peers. The macrodiversity results confirmed our prior findings for viral populations (microbial shotgun metagenomes were not available) that diversity did not significantly differ between autistic and neurotypical children. However, by also quantifying microdiversity, MetaPop revealed lower average viral nucleotide diversity ( π ) in autistic children. Analysis of the percentage of genomes detected under positive selection was also lower among autistic children, suggesting that higher viral π in neurotypical children may be beneficial because it allows populations to better “bet hedge” in changing environments. Further, comparisons of microdiversity pre- and post-FMT in autistic children revealed that the delivery FMT method (oral versus rectal) may influence viral activity and engraftment of microdiverse viral populations, with children who received their FMT rectally having higher microdiversity post-FMT. Overall, these results show that analyses at the macro level alone can miss important biological differences. Conclusions These findings suggest that standardized population and genetic variation analyses will be invaluable for maximizing biological inference, and MetaPop provides a convenient tool package to explore the dual impact of macro - and microdiversity across microbial communities. 
    more » « less
  5. Background

    The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference‐based and gene homology‐based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.


    Here we developed a reference‐free and alignment‐free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.


    Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state‐of‐the‐art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under‐represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.


    Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

    more » « less