skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 11 until 2:00 AM ET on Saturday, July 12 due to maintenance. We apologize for the inconvenience.


Title: ViralRecall—A Flexible Command-Line Tool for the Detection of Giant Virus Signatures in ‘Omic Data
Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversity of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in ‘omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub: https://github.com/faylward/viralrecall.  more » « less
Award ID(s):
1918271
PAR ID:
10233045
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Viruses
Volume:
13
Issue:
2
ISSN:
1999-4915
Page Range / eLocation ID:
150
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments. 
    more » « less
  2. Abstract Dinoflagellates from the family Symbiodiniaceae are phototrophic marine protists that engage in symbiosis with diverse hosts. Their large and distinct genomes are characterized by pervasive gene duplication and large-scale retroposition events. However, little is known about the role and scale of horizontal gene transfer (HGT) in the evolution of this algal family. In other dinoflagellates, high levels of HGTs have been observed, linked to major genomic transitions, such as the appearance of a viral-acquired nucleoprotein that originated via HGT from a large DNA algal virus. Previous work showed that Symbiodiniaceae from different hosts are actively infected by viral groups, such as giant DNA viruses and ssRNA viruses, that may play an important role in coral health. Latent viral infections may also occur, whereby viruses could persist in the cytoplasm or integrate into the host genome as a provirus. This hypothesis received experimental support; however, the cellular localization of putative latent viruses and their taxonomic affiliation are still unknown. In addition, despite the finding of viral sequences in some genomes of Symbiodiniaceae, viral origin, taxonomic breadth, and metabolic potential have not been explored. To address these questions, we searched for putative viral-derived proteins in thirteen Symbiodiniaceae genomes. We found fifty-nine candidate viral-derived HGTs that gave rise to twelve phylogenies across ten genomes. We also describe the taxonomic affiliation of these virus-related sequences, their structure, and their genomic context. These results lead us to propose a model to explain the origin and fate of Symbiodiniaceae viral acquisitions. 
    more » « less
  3. null (Ed.)
    Abstract Background Viruses are a significant player in many biosphere and human ecosystems, but most signals remain “hidden” in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools. Results Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score > 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales ). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2’s modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity. Conclusion With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in various ecosystems. Source code of VirSorter2 is freely available ( https://bitbucket.org/MAVERICLab/virsorter2 ), and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse ( https://de.cyverse.org/de ). 
    more » « less
  4. null (Ed.)
    Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k -mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k -mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets. 
    more » « less
  5. Bordenstein, Seth (Ed.)
    ABSTRACT Viruses belonging to the Nucleocytoviricota phylum are globally distributed and include members with notably large genomes and complex functional repertoires. Recent studies have shown that these viruses are particularly diverse and abundant in marine systems, but the magnitude of actively replicating Nucleocytoviricota present in ocean habitats remains unclear. In this study, we compiled a curated database of 2,431 Nucleocytoviricota genomes and used it to examine the gene expression of these viruses in a 2.5-day metatranscriptomic time-series from surface waters of the California Current. We identified 145 viral genomes with high levels of gene expression, including 90 Imitervirales and 49 Algavirales viruses. In addition to recovering high expression of core genes involved in information processing that are commonly expressed during viral infection, we also identified transcripts of diverse viral metabolic genes from pathways such as glycolysis, the TCA cycle, and the pentose phosphate pathway, suggesting that virus-mediated reprogramming of central carbon metabolism is common in oceanic surface waters. Surprisingly, we also identified viral transcripts with homology to actin, myosin, and kinesin domains, suggesting that viruses may use these gene products to manipulate host cytoskeletal dynamics during infection. We performed phylogenetic analysis on the virus-encoded myosin and kinesin proteins, which demonstrated that most belong to deep-branching viral clades, but that others appear to have been acquired from eukaryotes more recently. Our results highlight a remarkable diversity of active Nucleocytoviricota in a coastal marine system and underscore the complex functional repertoires expressed by these viruses during infection. IMPORTANCE The discovery of giant viruses has transformed our understanding of viral complexity. Although viruses have traditionally been viewed as filterable infectious agents that lack metabolism, giant viruses can reach sizes rivalling cellular lineages and possess genomes encoding central metabolic processes. Recent studies have shown that giant viruses are widespread in aquatic systems, but the activity of these viruses and the extent to which they reprogram host physiology in situ remains unclear. Here, we show that numerous giant viruses consistently express central metabolic enzymes in a coastal marine system, including components of glycolysis, the TCA cycle, and other pathways involved in nutrient homeostasis. Moreover, we found expression of several viral-encoded actin, myosin, and kinesin genes, indicating viral manipulation of the host cytoskeleton during infection. Our study reveals a high activity of giant viruses in a coastal marine system and indicates they are a diverse and underappreciated component of microbial diversity in the ocean. 
    more » « less