Predicting the hosts of newly discovered viruses is important for pandemic surveillance of infectious diseases. We investigated the use of alignment-based and alignment-free methods and support vector machine using mononucleotide frequency and dinucleotide bias to predict the hosts of viruses, and applied these approaches to three datasets: rabies virus, coronavirus, and influenza A virus. For coronavirus, we used the spike gene sequences, while for rabies and influenza A viruses, we used the more conserved nucleoprotein gene sequences. We compared the three methods under different scenarios and showed that their performances are highly correlated with the variability of sequences and sample size. For conserved genes like the nucleoprotein gene, longer
- PAR ID:
- 10154096
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Scientific Reports
- Volume:
- 8
- Issue:
- 1
- ISSN:
- 2045-2322
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract Dinoflagellates from the family Symbiodiniaceae are phototrophic marine protists that engage in symbiosis with diverse hosts. Their large and distinct genomes are characterized by pervasive gene duplication and large-scale retroposition events. However, little is known about the role and scale of horizontal gene transfer (HGT) in the evolution of this algal family. In other dinoflagellates, high levels of HGTs have been observed, linked to major genomic transitions, such as the appearance of a viral-acquired nucleoprotein that originated via HGT from a large DNA algal virus. Previous work showed that Symbiodiniaceae from different hosts are actively infected by viral groups, such as giant DNA viruses and ssRNA viruses, that may play an important role in coral health. Latent viral infections may also occur, whereby viruses could persist in the cytoplasm or integrate into the host genome as a provirus. This hypothesis received experimental support; however, the cellular localization of putative latent viruses and their taxonomic affiliation are still unknown. In addition, despite the finding of viral sequences in some genomes of Symbiodiniaceae, viral origin, taxonomic breadth, and metabolic potential have not been explored. To address these questions, we searched for putative viral-derived proteins in thirteen Symbiodiniaceae genomes. We found fifty-nine candidate viral-derived HGTs that gave rise to twelve phylogenies across ten genomes. We also describe the taxonomic affiliation of these virus-related sequences, their structure, and their genomic context. These results lead us to propose a model to explain the origin and fate of Symbiodiniaceae viral acquisitions.more » « less
-
Abstract Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), and
Rhinolophus affinis coronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events. -
Chloroviruses (family Phycodnaviridae) are dsDNA viruses found throughout the world’s inland waters. The open reading frames in the genomes of 41 sequenced chloroviruses (330 ± 40 kbp each) representing three virus types were analyzed for evidence of evolutionarily conserved local genomic “contexts”, the organization of biological information into units of a scale larger than a gene. Despite a general loss of synteny between virus types, we informatically detected a highly conserved genomic context defined by groups of three or more genes that we have termed “gene gangs”. Unlike previously described local genomic contexts, the definition of gene gangs requires only that member genes be consistently co-localized and are not constrained by strand, regulatory sites, or intervening sequences (and therefore represent a new type of conserved structural genomic element). An analysis of functional annotations and transcriptomic data suggests that some of the gene gangs may organize genes involved in specific biochemical processes, but that this organization does not involve their coordinated expression.more » « less
-
Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device, coupled with label-free Raman spectroscopy, holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning (ML) approach applied to recognize the virus based on its Raman spectrum, which is used as a fingerprint. We present such an ML approach for analyzing Raman spectra of human and avian viruses. A convolutional neural network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks. In particular, it achieves 99% accuracy for classifying influenza virus type A versus type B, 96% accuracy for classifying four subtypes of influenza A, 95% accuracy for differentiating enveloped and nonenveloped viruses, and 99% accuracy for differentiating avian coronavirus (infectious bronchitis virus [IBV]) from other avian viruses. Furthermore, interpretation of neural net responses in the trained CNN model using a full-gradient algorithm highlights Raman spectral ranges that are most important to virus identification. By correlating ML-selected salient Raman ranges with the signature ranges of known biomolecules and chemical functional groups—for example, amide, amino acid, and carboxylic acid—we verify that our ML model effectively recognizes the Raman signatures of proteins, lipids, and other vital functional groups present in different viruses and uses a weighted combination of these signatures to identify viruses.more » « less
-
Background The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference‐based and gene homology‐based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods Here we developed a reference‐free and alignment‐free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.
Results Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state‐of‐the‐art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under‐represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.
Conclusions Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.