skip to main content


Title: efam: an e xpanded, metaproteome-supported HMM profile database of viral protein fam ilies
Abstract Motivation Viruses infect, reprogram, and kill microbes, leading to profound ecosystem consequences, from elemental cycling in oceans and soils to microbiome-modulated diseases in plants and animals. Although metagenomic datasets are increasingly available, identifying viruses in them is challenging due to poor representation and annotation of viral sequences in databases. Results Here we establish efam, an expanded collection of Hidden Markov Model (HMM) profiles that represent viral protein families conservatively identified from the Global Ocean Virome 2.0 dataset. This resulted in 240,311 HMM profiles, each with at least 2 protein sequences, making efam >7-fold larger than the next largest, pan-ecosystem viral HMM profile database. Adjusting the criteria for viral contig confidence from “conservative” to “eXtremely Conservative” resulted in 37,841 HMM profiles in our efam-XC database. To assess the value of this resource, we integrated efam-XC into VirSorter viral discovery software to discover viruses from less-studied, ecologically distinct oxygen minimum zone (OMZ) marine habitats. This expanded database led to an increase in viruses recovered from every tested OMZ virome by ∼24% on average (up to ∼42%) and especially improved the recovery of often-missed shorter contigs (<5 kb). Additionally, to help elucidate lesser-known viral protein functions, we annotated the profiles using multiple databases from the DRAM pipeline and virion-associated metaproteomic data, which doubled the number of annotations obtainable by standard, single-database annotation approaches. Together, these marine resources (efam and efam-XC) are provided as searchable, compressed HMM databases that will be updated bi-annually to help maximize viral sequence discovery and study from any ecosystem. Availability The resources are available on the iVirus platform at (doi.org/10.25739/9vze-4143). Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1829831 1759874
NSF-PAR ID:
10256497
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Editor(s):
Robinson, Peter
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pickett, Brett E. ; Jurado, Kellie (Ed.)
    ABSTRACT Data that catalogue viral diversity on Earth have been fragmented across sources, disciplines, formats, and various degrees of open sharing, posing challenges for research on macroecology, evolution, and public health. Here, we solve this problem by establishing a dynamically maintained database of vertebrate-virus associations, called The Global Virome in One Network (VIRION). The VIRION database has been assembled through both reconciliation of static data sets and integration of dynamically updated databases. These data sources are all harmonized against one taxonomic backbone, including metadata on host and virus taxonomic validity and higher classification; additional metadata on sampling methodology and evidence strength are also available in a harmonized format. In total, the VIRION database is the largest open-source, open-access database of its kind, with roughly half a million unique records that include 9,521 resolved virus “species” (of which 1,661 are ICTV ratified), 3,692 resolved vertebrate host species, and 23,147 unique interactions between taxonomically valid organisms. Together, these data cover roughly a quarter of mammal diversity, a 10th of bird diversity, and ∼6% of the estimated total diversity of vertebrates, and a much larger proportion of their virome than any previous database. We show how these data can be used to test hypotheses about microbiology, ecology, and evolution and make suggestions for best practices that address the unique mix of evidence that coexists in these data. IMPORTANCE Animals and their viruses are connected by a sprawling, tangled network of species interactions. Data on the host-virus network are available from several sources, which use different naming conventions and often report metadata in different levels of detail. VIRION is a new database that combines several of these existing data sources, reconciles taxonomy to a single consistent backbone, and reports metadata in a format designed by and for virologists. Researchers can use VIRION to easily answer questions like “Can any fish viruses infect humans?” or “Which bats host coronaviruses?” or to build more advanced predictive models, making it an unprecedented step toward a full inventory of the global virome. 
    more » « less
  2. Raina, Jean-Baptiste (Ed.)
    ABSTRACT Nutrient availability can significantly influence microbial genomic and proteomic streamlining, for example, by selecting for lower nitrogen to carbon ratios. Oligotrophic open ocean microbes have streamlined genomic nitrogen requirements relative to those of their counterparts in nutrient-rich coastal waters. However, steep gradients in nutrient availability occur at meter-level, and even micron-level, spatial scales. It is unclear whether such gradients also structure genomic and proteomic stoichiometry. Focusing on the eastern tropical North Pacific oxygen minimum zone (OMZ), we use comparative metagenomics to examine how nitrogen availability shapes microbial and viral genome properties along the vertical gradient across the OMZ and between two size fractions, distinguishing free-living microbes versus particle-associated microbes. We find a substantial increase in the nitrogen content of encoded proteins in particle-associated over free-living bacteria and archaea across nitrogen availability regimes over depth. Within each size fraction, we find that bacterial and viral genomic nitrogen tends to increase with increasing nitrate concentrations with depth. In contrast to cellular genes, the nitrogen content of virus proteins does not differ between size fractions. We identified arginine as a key amino acid in the modulation of the C:N ratios of core genes for bacteria, archaea, and viruses. Functional analysis reveals that particle-associated bacterial metagenomes are enriched for genes that are involved in arginine metabolism and organic nitrogen compound catabolism. Our results are consistent with nitrogen streamlining in both cellular and viral genomes on spatial scales of meters to microns. These effects are similar in magnitude to those previously reported across scales of thousands of kilometers. IMPORTANCE The genomes of marine microbes can be shaped by nutrient cycles, with ocean-scale gradients in nitrogen availability being known to influence microbial amino acid usage. It is unclear, however, how genomic properties are shaped by nutrient changes over much smaller spatial scales, for example, along the vertical transition into oxygen minimum zones (OMZs) or from the exterior to the interior of detrital particles. Here, we measure protein nitrogen usage by marine bacteria, archaea, and viruses by using metagenomes from the nitracline of the eastern tropical North Pacific OMZ, including both particle-associated and nonassociated biomass. Our results show higher genomic and proteomic nitrogen content in particle-associated microbes and at depths with higher nitrogen availability for cellular and viral genomes. This discovery suggests that stoichiometry influences microbial and viral evolution across multiple scales, including the micrometer to millimeter scale associated with particle-associated versus free-living lifestyles. 
    more » « less
  3. Abstract

    The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has resulted in the loss of millions of lives and severe global economic consequences. Every time SARS-CoV-2 replicates, the viruses acquire new mutations in their genomes. Mutations in SARS-CoV-2 genomes led to increased transmissibility, severe disease outcomes, evasion of the immune response, changes in clinical manifestations and reducing the efficacy of vaccines or treatments. To date, the multiple resources provide lists of detected mutations without key functional annotations. There is a lack of research examining the relationship between mutations and various factors such as disease severity, pathogenicity, patient age, patient gender, cross-species transmission, viral immune escape, immune response level, viral transmission capability, viral evolution, host adaptability, viral protein structure, viral protein function, viral protein stability and concurrent mutations. Deep understanding the relationship between mutation sites and these factors is crucial for advancing our knowledge of SARS-CoV-2 and for developing effective responses. To fill this gap, we built COV2Var, a function annotation database of SARS-CoV-2 genetic variation, available at http://biomedbdc.wchscu.cn/COV2Var/. COV2Var aims to identify common mutations in SARS-CoV-2 variants and assess their effects, providing a valuable resource for intensive functional annotations of common mutations among SARS-CoV-2 variants.

     
    more » « less
  4. Abstract

    The parasitoid wasp Venturia canescens is an important biological control agent of stored products moth pests and serves as a model to study the function and evolution of domesticated endogenous viruses (DEVs). The DEVs discovered in V. canescens are known as virus-like particles (VcVLPs), which are produced using nudivirus-derived components and incorporate wasp-derived virulence proteins instead of packaged nucleic acids. Previous studies of virus-derived components in the V. canescens genome identified 53 nudivirus-like genes organized in six gene clusters and several viral pseudogenes, but how VcVLP genes are organized among wasp chromosomes following their integration in the ancestral wasp genome is largely unknown. Here, we present a chromosomal scale genome of V. canescens consisting of 11 chromosomes and 56 unplaced small scaffolds. The genome size is 290.8 Mbp with a N50 scaffold size of 24.99 Mbp. A high-quality gene set including 11,831 protein-coding genes were produced using RNA-Seq data as well as publicly available peptide sequences from related Hymenoptera. A manual annotation of genes of viral origin produced 61 intact and 19 pseudogenized nudivirus-derived genes. The genome assembly revealed that two previously identified clusters were joined into a single cluster and a total of 5 gene clusters comprising of 60 intact nudivirus-derived genes were located in three chromosomes. In contrast, pseudogenes are dispersed among 8 chromosomes with only 4 pseudogenes associated with nudivirus gene clusters. The architecture of genes encoding VcVLP components suggests it originates from a recent virus acquisition and there is a link between the processes of dispersal and pseudogenization. This high-quality genome assembly and annotation represents the first chromosome-scale assembly for parasitoid wasps associated with VLPs, and is publicly available in the National Center for Biotechnology Information Genome and RefSeq databases, providing a valuable resource for future studies of DEVs in parasitoid wasps.

     
    more » « less
  5. INTRODUCTION One of the central applications of the human reference genome has been to serve as a baseline for comparison in nearly all human genomic studies. Unfortunately, many difficult regions of the reference genome have remained unresolved for decades and are affected by collapsed duplications, missing sequences, and other issues. Relative to the current human reference genome, GRCh38, the Telomere-to-Telomere CHM13 (T2T-CHM13) genome closes all remaining gaps, adds nearly 200 million base pairs (Mbp) of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for scientific inquiry. RATIONALE We demonstrate how the T2T-CHM13 reference genome universally improves read mapping and variant identification in a globally diverse cohort. This cohort includes all 3202 samples from the expanded 1000 Genomes Project (1KGP), sequenced with short reads, as well as 17 globally diverse samples sequenced with long reads. By applying state-of-the-art methods for calling single-nucleotide variants (SNVs) and structural variants (SVs), we document the strengths and limitations of T2T-CHM13 relative to its predecessors and highlight its promise for revealing new biological insights within technically challenging regions of the genome. RESULTS Across the 1KGP samples, we found more than 1 million additional high-quality variants genome-wide using T2T-CHM13 than with GRCh38. Within previously unresolved regions of the genome, we identified hundreds of thousands of variants per sample—a promising opportunity for evolutionary and biomedical discovery. T2T-CHM13 improves the Mendelian concordance rate among trios and eliminates tens of thousands of spurious SNVs per sample, including a reduction of false positives in 269 challenging, medically relevant genes by up to a factor of 12. These corrections are in large part due to improvements to 70 protein-coding genes in >9 Mbp of inaccurate sequence caused by falsely collapsed or duplicated regions in GRCh38. Using the T2T-CHM13 genome also yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions. Finally, by providing numerous resources for T2T-CHM13 (including 1KGP genotypes, accessibility masks, and prominent annotation databases), our work will facilitate the transition to T2T-CHM13 from the current reference genome. CONCLUSION The vast improvements in variant discovery across samples of diverse ancestries position T2T-CHM13 to succeed as the next prevailing reference for human genetics. T2T-CHM13 thus offers a model for the construction and study of high-quality reference genomes from globally diverse individuals, such as is now being pursued through collaboration with the Human Pangenome Reference Consortium. As a foundation, our work underscores the benefits of an accurate and complete reference genome for revealing diversity across human populations. Genomic features and resources available for T2T-CHM13. Comparisons to GRCh38 reveal broad improvements in SNVs, indels, and SVs discovered across diverse human populations by means of short-read (1KGP) and long-read sequencing (LRS). These improvements are due to resolution of complex genomic loci (nonsyntenic and previously unresolved), duplication errors, and discordant haplotypes, including those in medically relevant genes. 
    more » « less