skip to main content

This content will become publicly available on December 1, 2022

Title: VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses
Abstract Background Viruses are a significant player in many biosphere and human ecosystems, but most signals remain “hidden” in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools. Results Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score > 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales ). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2’s modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity. Conclusion With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in more » various ecosystems. Source code of VirSorter2 is freely available ( ), and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse ( ). « less
; ; ; ; ; ; ; ; ; ;
Award ID(s):
1829831 1759874
Publication Date:
Journal Name:
Sponsoring Org:
National Science Foundation
More Like this
  1. Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virusmore »taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k -mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k -mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets.« less
  2. Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversitymore »of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in ‘omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub:« less
  3. Wayne, Marta (Ed.)
    Abstract The Ichneumonoidea (Ichneumonidae and Braconidae) is an incredibly diverse superfamily of parasitoid wasps that includes species that produce virus-like entities in their reproductive tracts to promote successful parasitism of host insects. Research on these entities has traditionally focused upon two viral genera Bracovirus (in Braconidae) and Ichnovirus (in Ichneumonidae). These viruses are produced using genes known collectively as endogenous viral elements (EVEs) that represent historical, now heritable viral integration events in wasp genomes. Here, new genome sequence assemblies for 11 species and 6 publicly available genomes from the Ichneumonoidea were screened with the goal of identifying novel EVEs andmore »characterizing the breadth of species in lineages with known EVEs. Exhaustive similarity searches combined with the identification of ancient core genes revealed sequences from both known and novel EVEs. One species harbored a novel, independently derived EVE related to a divergent large double-stranded DNA (dsDNA) virus that manipulates behavior in other hymenopteran species. Although bracovirus or ichnovirus EVEs were identified as expected in three species, the absence of ichnoviruses in several species suggests that they are independently derived and present in two younger, less widespread lineages than previously thought. Overall, this study presents a novel bioinformatic approach for EVE discovery in genomes and shows that three divergent virus families (nudiviruses, the ancestors of ichnoviruses, and Leptopilina boulardi Filamentous Virus-like viruses) are recurrently acquired as EVEs in parasitoid wasps. Virus acquisition in the parasitoid wasps is a common process that has occurred in many more than two lineages from a diverse range of arthropod-infecting dsDNA viruses.« less
  4. Hatfull, Graham F. (Ed.)
    ABSTRACT Bacteria and bacteriophages (phages) have evolved potent defense and counterdefense mechanisms that allowed their survival and greatest abundance on Earth. CRISPR (clustered regularly interspaced short palindromic repeat)-Cas (CRISPR-associated) is a bacterial defense system that inactivates the invading phage genome by introducing double-strand breaks at targeted sequences. While the mechanisms of CRISPR defense have been extensively investigated, the counterdefense mechanisms employed by phages are poorly understood. Here, we report a novel counterdefense mechanism by which phage T4 restores the genomes broken by CRISPR cleavages. Catalyzed by the phage-encoded recombinase UvsX, this mechanism pairs very short stretches of sequence identity (minihomologymore »sites), as few as 3 or 4 nucleotides in the flanking regions of the cleaved site, allowing replication, repair, and stitching of genomic fragments. Consequently, a series of deletions are created at the targeted site, making the progeny genomes completely resistant to CRISPR attack. Our results demonstrate that this is a general mechanism operating against both type II (Cas9) and type V (Cas12a) CRISPR-Cas systems. These studies uncovered a new type of counterdefense mechanism evolved by T4 phage where subtle functional tuning of preexisting DNA metabolism leads to profound impact on phage survival. IMPORTANCE Bacteriophages (phages) are viruses that infect bacteria and use them as replication factories to assemble progeny phages. Bacteria have evolved powerful defense mechanisms to destroy the invading phages by severing their genomes soon after entry into cells. We discovered a counterdefense mechanism evolved by phage T4 to stitch back the broken genomes and restore viral infection. In this process, a small amount of genetic material is deleted or another mutation is introduced, making the phage resistant to future bacterial attack. The mutant virus might also gain survival advantages against other restriction conditions or DNA damaging events. Thus, bacterial attack not only triggers counterdefenses but also provides opportunities to generate more fit phages. Such defense and counterdefense mechanisms over the millennia led to the extraordinary diversity and the greatest abundance of bacteriophages on Earth. Understanding these mechanisms will open new avenues for engineering recombinant phages for biomedical applications.« less
  5. Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viralmore »genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.« less