skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Bidirectional subsethood of shared marker profiles enables accurate virus classification
Background. Due to the impact of viral metagenomic sequencing, the official virus taxonomy is updated several times a year, with labels being renamed even substantially across releases. While this helps reveal newer aspects on the classification of viruses, existing bioinformatic methods for classification struggle to stay in sync with this everimproving resource. Results. We developed a new computer program, named Virgo, that is able to correctly predict virus families from metagenomic data with an F1 score above 0.9 using a novel viral sequence similarity metric proposed in this work. Moreover, it ensures compatibility with any version of the official taxonomy of viruses. Conclusions. Virgo is designed to easily incorporate newer releases of the official taxonomy, thus representing a valuable resource in the virology community while raising awareness to develop computational methods that evolve alongside manually curated resources.  more » « less
Award ID(s):
2125142
PAR ID:
10654603
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Microbiome
Volume:
13
Issue:
1
ISSN:
2049-2618
Subject(s) / Keyword(s):
Classification and taxonomy, Virology, Software, Metagenomics, Bidirectional subsethood, ICTV
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Viruses of the phylumNucleocytoviricota, often referred to as “giant viruses,” are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG (TaxonomicInformation ofGiant viruses usingTrademarkOrthologousGroups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1531 quality-checked, phylogenetically diverseNucleocytoviricotagenomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% at the order level and 97.3% at the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm’s performance or the models’ predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% at the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups. 
    more » « less
  2. metaviralSPAdes: Assembly of Viruses From Metagenomic Data Abstract Motivation: Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth's virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies. Results: We describe a metaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes. We benchmarked metaviralSPAdes on diverse metagenomic datasets, verified our predictions using a set of virus-specific Hidden Markov Models, and demonstrated that it improves on the state-of-the-art viral identification pipelines. Availability: metaviralSPAdes includes viralAssembly, viralVerify, and viralComplete modules that are available as standalone packages: https://github.com/ablab/spades/tree/metaviral_publication, https://github.com/ablab/viralVerify/ and https://github.com/ablab/viralComplete/. Supplementary information: Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Predicting and simplifying which pathogens may spill over from animals to humans is a major priority in infectious disease biology. Many efforts to determine which viruses are at risk of spillover use a subset of viral traits to find trait-based associations with spillover. We adapt a new method—phylofactorization—to identify not traits but lineages of viruses at risk of spilling over. Phylofactorization is used to partition the International Committee on Taxonomy of Viruses viral taxonomy based on non-human host range of viruses and whether there exists evidence the viruses have infected humans. We identify clades on a range of taxonomic levels with high or low propensities to spillover, thereby simplifying the classification of zoonotic potential of mammalian viruses. Phylofactorization by whether a virus is zoonotic yields many disjoint clades of viruses containing few to no representatives that have spilled over to humans. Phylofactorization by non-human host breadth yields several clades with significantly higher host breadth. We connect the phylogenetic factors above with life-histories of clades, revisit trait-based analyses, and illustrate how cladistic coarse-graining of zoonotic potential can refine trait-based analyses by illuminating clade-specific determinants of spillover risk. 
    more » « less
  4. Abstract The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available athttps://github.com/dyxstat/ViralCC. 
    more » « less
  5. null (Ed.)
    Abstract Our current knowledge of host–virus interactions in biofilms is limited to computational predictions based on laboratory experiments with a small number of cultured bacteria. However, natural biofilms are diverse and chiefly composed of uncultured bacteria and archaea with no viral infection patterns and lifestyle predictions described to date. Herein, we predict the first DNA sequence-based host–virus interactions in a natural biofilm. Using single-cell genomics and metagenomics applied to a hot spring mat of the Cone Pool in Mono County, California, we provide insights into virus–host range, lifestyle and distribution across different mat layers. Thirty-four out of 130 single cells contained at least one viral contig (26%), which, together with the metagenome-assembled genomes, resulted in detection of 59 viruses linked to 34 host species. Analysis of single-cell amplification kinetics revealed a lack of active viral replication on the single-cell level. These findings were further supported by mapping metagenomic reads from different mat layers to the obtained host–virus pairs, which indicated a low copy number of viral genomes compared to their hosts. Lastly, the metagenomic data revealed high layer specificity of viruses, suggesting limited diffusion to other mat layers. Taken together, these observations indicate that in low mobility environments with high microbial abundance, lysogeny is the predominant viral lifestyle, in line with the previously proposed “Piggyback-the-Winner” theory. 
    more » « less