Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.
more »
« less
DeepMicroClass sorts metagenomic contigs into prokaryotes, eukaryotes and viruses
Abstract Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.
more »
« less
- Award ID(s):
- 2125142
- PAR ID:
- 10558933
- Publisher / Repository:
- NAR
- Date Published:
- Journal Name:
- NAR Genomics and Bioinformatics
- Volume:
- 6
- Issue:
- 2
- ISSN:
- 2631-9268
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundViruses, the majority of which are uncultivated, are among the most abundant biological entities on Earth. From altering microbial physiology to driving community dynamics, viruses are fundamental members of microbiomes. While the number of studies leveraging viral metagenomics (viromics) for studying uncultivated viruses is growing, standards for viromics research are lacking. Viromics can utilize computational discovery of viruses from total metagenomes of all community members (hereafter metagenomes) or use physical separation of virus-specific fractions (hereafter viromes). However, differences in the recovery and interpretation of viruses from metagenomes and viromes obtained from the same samples remain understudied. ResultsHere, we compare viral communities from paired viromes and metagenomes obtained from 60 diverse samples across human gut, soil, freshwater, and marine ecosystems. Overall, viral communities obtained from viromes had greater species richness and total viral genome abundances than those obtained from metagenomes, although there were some exceptions. Despite this, metagenomes still contained many viral genomes not detected in viromes. We also found notable differences in the predicted lytic state of viruses detected in viromes vs metagenomes at the time of sequencing. Other forms of variation observed include genome presence/absence, genome quality, and encoded protein content between viromes and metagenomes, but the magnitude of these differences varied by environment. ConclusionsOverall, our results show that the choice of method can lead to differing interpretations of viral community ecology. We suggest that the choice of whether to target a metagenome or virome to study viral communities should be dependent on the environmental context and ecological questions being asked. However, our overall recommendation to researchers investigating viral ecology and evolution is to pair both approaches to maximize their respective benefits.more » « less
-
Abstract The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available athttps://github.com/dyxstat/ViralCC.more » « less
-
Fraser, Claire M. (Ed.)ABSTRACT Metagenomics is a powerful method for interpreting the ecological roles and physiological capabilities of mixed microbial communities. Yet, many tools for processing metagenomic data are neither designed to consider eukaryotes nor are they built for an increasing amount of sequence data. EukHeist is an automated pipeline to retrieve eukaryotic and prokaryotic metagenome-assembled genomes (MAGs) from large-scale metagenomic sequence data sets. We developed the EukHeist workflow to specifically process large amounts of both metagenomic and/or metatranscriptomic sequence data in an automated and reproducible fashion. Here, we applied EukHeist to the large-size fraction data (0.8–2,000 µm) from Tara Oceans to recover both eukaryotic and prokaryotic MAGs, which we refer to as TOPAZ (Tara Oceans Particle-Associated MAGs). The TOPAZ MAGs consisted of >900 environmentally relevant eukaryotic MAGs and >4,000 bacterial and archaeal MAGs. The bacterial and archaeal TOPAZ MAGs expand upon the phylogenetic diversity of likely particle- and host-associated taxa. We use these MAGs to demonstrate an approach to infer the putative trophic mode of the recovered eukaryotic MAGs. We also identify ecological cohorts of co-occurring MAGs, which are driven by specific environmental factors and putative host-microbe associations. These data together add to a number of growing resources of environmentally relevant eukaryotic genomic information. Complementary and expanded databases of MAGs, such as those provided through scalable pipelines like EukHeist, stand to advance our understanding of eukaryotic diversity through increased coverage of genomic representatives across the tree of life. IMPORTANCESingle-celled eukaryotes play ecologically significant roles in the marine environment, yet fundamental questions about their biodiversity, ecological function, and interactions remain. Environmental sequencing enables researchers to document naturally occurring protistan communities, without culturing bias, yet metagenomic and metatranscriptomic sequencing approaches cannot separate individual species from communities. To more completely capture the genomic content of mixed protistan populations, we can create bins of sequences that represent the same organism (metagenome-assembled genomes [MAGs]). We developed the EukHeist pipeline, which automates the binning of population-level eukaryotic and prokaryotic genomes from metagenomic reads. We show exciting insight into what protistan communities are present and their trophic roles in the ocean. Scalable computational tools, like EukHeist, may accelerate the identification of meaningful genetic signatures from large data sets and complement researchers’ efforts to leverage MAG databases for addressing ecological questions, resolving evolutionary relationships, and discovering potentially novel biodiversity.more » « less
-
Abstract Lichen thalli are formed through the symbiotic association of a filamentous fungus and photosynthetic green alga and/or cyanobacterium. Recent studies have revealed lichens also host highly diverse communities of secondary fungal and bacterial symbionts, yet few studies have examined the viral component within these complex symbioses. Here, we describe viral biodiversity and functions in cyanolichens collected from across North America and Europe. As current machine-learning viral-detection tools are not trained on complex eukaryotic metagenomes, we first developed efficient methods to remove eukaryotic reads prior to viral detection and a custom pipeline to validate viral contigs predicted with three machine-learning methods. Our resulting high-quality viral data illustrate that every cyanolichen thallus contains diverse viruses that are distinct from viruses in other terrestrial ecosystems. In addition to cyanobacteria, predicted viral hosts include other lichen-associated bacterial lineages and algae, although a large fraction of viral contigs had no host prediction. Functional annotation of cyanolichen viral sequences predicts numerous viral-encoded auxiliary metabolic genes (AMGs) involved in amino acid, nucleotide, and carbohydrate metabolism, including AMGs for secondary metabolism (antibiotics and antimicrobials) and fatty acid biosynthesis. Overall, the diversity of cyanolichen AMGs suggests that viruses may alter microbial interactions within these complex symbiotic assemblages.more » « less
An official website of the United States government

