skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 16, 2026

Title: A Novel Pipeline for Virus Integration Sites Detection in Tumor Genomes Using Deep Learning
Cancer is one of the leading causes of death world- wide. Pathogenic viruses are estimated to be responsible for 15% of all human cancers globally and pose significant threats to pub- lic health. Viruses integrate their genetic material into the host genome, increasing the risk of cancer promoting changes in it. To understand the molecular mechanisms of virus-mediated cancers, it is crucial to identify viral insertion sites in cancer genomes. However, this effort is hindered by the rapidly increasing volume of tumor sequencing data, along with the challenges of accurate data analysis caused by high viral mutation rates and the difficulty of aligning short reads to the reference genome. Thus it is crucial to develop an efficient method for virus integration site detection in tumor genomes. This paper proposes a novel pipeline to identify viral integration sites leveraging deep Convolutional Neural Networks (CNN). Our contributions are twofold: (i) We propose and integrate three novel matrix generation methods into the pipeline, developed after aligning the host and viral genomes with their respective reference genomes.; (ii) We employ one-hot encoded images with reduced computational complexity to represent viral integration sites and harness the capabilities of Deep CNN networks for detection. The paper illustrates our proposed approach and presents experiments conducted using both synthetic and real sequencing data. Our preliminary experimental results are promising, showcasing the effectiveness of the proposed methods in detecting viral integration sites.  more » « less
Award ID(s):
2334391
PAR ID:
10570735
Author(s) / Creator(s):
;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-6248-0
Page Range / eLocation ID:
4946 to 4951
Subject(s) / Keyword(s):
CNN, sequencing, NGS, matrix
Format(s):
Medium: X
Location:
Washington, DC, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Scarpino, Samuel V (Ed.)
    Viruses of microbes are ubiquitous biological entities that reprogram their hosts’ metabolisms during infection in order to produce viral progeny, impacting the ecology and evolution of microbiomes with broad implications for human and environmental health. Advances in genome sequencing have led to the discovery of millions of novel viruses and an appreciation for the great diversity of viruses on Earth. Yet, with knowledge of only“who is there?”we fall short in our ability to infer the impacts of viruses on microbes at population, community, and ecosystem-scales. To do this, we need a more explicit understanding“who do they infect?”Here, we developed a novel machine learning model (ML), Virus-Host Interaction Predictor (VHIP), to predict virus-host interactions (infection/non-infection) from input virus and host genomes. This ML model was trained and tested on a high-value manually curated set of 8849 virus-host pairs and their corresponding sequence data. The resulting dataset, ‘Virus Host Range network’ (VHRnet), is core to VHIP functionality. Each data point that underlies the VHIP training and testing represents a lab-tested virus-host pair in VHRnet, from which meaningful signals of viral adaptation to host were computed from genomic sequences. VHIP departs from existing virus-host prediction models in its ability to predict multiple interactions rather than predicting a single most likely host or host clade. As a result, VHIP is able to infer the complexity of virus-host networks in natural systems. VHIP has an 87.8% accuracy rate at predicting interactions between virus-host pairs at the species level and can be applied to novel viral and host population genomes reconstructed from metagenomic datasets. 
    more » « less
  2. Roossinck, Marilyn J. (Ed.)
    Insects are known to host a wide variety of beneficial microbes that are fundamental to many aspects of their biology and have substantially shaped their evolution. Notably, parasitoid wasps have repeatedly evolved beneficial associations with viruses that enable developing wasps to survive as parasites that feed from other insects. Ongoing genomic sequencing efforts have revealed that most of these virus-derived entities are fully integrated into the genomes of parasitoid wasp lineages, representing endogenous viral elements (EVEs) that retain the ability to produce virus or virus-like particles within wasp reproductive tissues. All documented parasitoid EVEs have undergone similar genomic rearrangements compared to their viral ancestors characterized by viral genes scattered across wasp genomes and specific viral gene losses. The recurrent presence of viral endogenization and genomic reorganization in beneficial virus systems identified to date suggest that these features are crucial to forming heritable alliances between parasitoid wasps and viruses. Here, our genomic characterization of a mutualistic poxvirus associated with the wasp Diachasmimorpha longicaudata , known as Diachasmimorpha longicaudata entomopoxvirus (DlEPV), has uncovered the first instance of beneficial virus evolution that does not conform to the genomic architecture shared by parasitoid EVEs with which it displays evolutionary convergence. Rather, DlEPV retains the exogenous viral genome of its poxvirus ancestor and the majority of conserved poxvirus core genes. Additional comparative analyses indicate that DlEPV is related to a fly pathogen and contains a novel gene expansion that may be adaptive to its symbiotic role. Finally, differential expression analysis during virus replication in wasps and fly hosts demonstrates a unique mechanism of functional partitioning that allows DlEPV to persist within and provide benefit to its parasitoid wasp host. 
    more » « less
  3. Abstract BackgroundInsects are an important reservoir of viral biodiversity, but the vast majority of viruses associated with insects have not been discovered. Recent studies have employed high-throughput RNA sequencing, which has led to rapid advances in our understanding of insect viral diversity. However, insect genomes frequently contain transcribed endogenous viral elements (EVEs) with significant homology to exogenous viruses, complicating the use of RNAseq for viral discovery. MethodsIn this study, we used a multi-pronged sequencing approach to study the virome of an important agricultural pest and prolific vector of plant pathogens, the potato aphidMacrosiphum euphorbiae. We first used rRNA-depleted RNAseq to characterize the microbes found in individual insects. We then used PCR screening to measure the frequency of two heritable viruses in a local aphid population. Lastly, we generated a quality draft genome assembly forM. euphorbiaeusing Illumina-corrected Nanopore sequencing to identify transcriptionally active EVEs in the host genome. ResultsWe found reads from two insect-specific viruses (aFlavivirusand anAmbidensovirus) in our RNAseq data, as well as a parasitoid virus (Bracovirus), a plant pathogenic virus (Tombusvirus), and two phages (Acinetobacter and APSE). However, our genome assembly showed that part of the ‘virome’ of this insect can be attributed to EVEs in the host genome. ConclusionOur work shows that EVEs have led to the misidentification of aphid viruses from RNAseq data, and we argue that this is a widespread challenge for the study of viral diversity in insects. 
    more » « less
  4. ABSTRACT Metagenomics is a powerful tool for characterising viruses, with broad applications across diverse disciplines, from understanding the ecology and evolutionary history of viruses to identifying causative agents of emerging outbreaks with unknown aetiology. Additionally, metagenomic data contains valuable information about the amount of virus present within samples. However, we have yet to leverage metagenomics to assess viral load, which is a key epidemiological parameter. To effectively use sequencing outputs to inform transmission, we need to understand the relationship between read depth and viral load across a diverse set of viruses. Here, using target enrichment sequencing, we investigated the detection and recovery of virus genomes by spiking known concentrations of DNA and RNA viruses into wild rodent faecal samples. In total, 15 experimental replicates were sequenced with target enrichment sequencing and compared to shotgun sequencing of the same background samples. Target enriched sequencing recovered all spike-in viruses at every concentration (102, 103, and 105± 1 log genome copies) and showed a log-linear relationship between spike-in concentration and mean read depth. Background viruses (includingKobuvirusandCardiovirus) were recovered consistently across all biological and technical replicates, but genome coverage was variable between virus genera and likely reflected the composition of target enrichment probe panel. Overall, our study highlights the strengths and weaknesses of using commercially available panels to quantify and characterise wildlife viromes, and underscores the importance of probe panel design for accurately interpreting coverage and read depth. To advance the use of metagenomics for understanding virus transmission, further research will be needed to elucidate how sequencing strategy (e.g. library depth, pooling), virome composition, and probe design influence viral read counts and genome coverage. 
    more » « less
  5. The advancement of high throughput sequencing has greatly facilitated the exploration of viruses that infect marine hosts. For example, a number of putative virus genomes belonging to the Totiviridae family have been described in crustacean hosts. However, there has been no characterization of the most newly discovered putative viruses beyond description of their genomes. In this study, two novel double-stranded RNA (dsRNA) virus genomes were discovered in the Atlantic blue crab ( Callinectes sapidus ) and further investigated. Sequencing of both virus genomes revealed that they each encode RNA dependent RNA polymerase proteins (RdRps) with similarities to toti-like viruses. The viruses were tentatively named Callinectes sapidus toti-like virus 1 (CsTLV1) and Callinectes sapidus toti-like virus 2 (CsTLV2). Both genomes have typical elements required for −1 ribosomal frameshifting, which may induce the expression of an encoded ORF1–ORF2 (gag-pol) fusion protein. Phylogenetic analyses of CsTLV1 and CsTLV2 RdRp amino acid sequences suggested that they are members of two new genera in the family Totiviridae . The CsTLV1 and CsTLV2 genomes were detected in muscle, gill, and hepatopancreas of blue crabs by real-time reverse transcription quantitative PCR (RT-qPCR). The presence of ~40 nm totivirus-like viral particles in all three tissues was verified by transmission electron microscopy, and pathology associated with CsTLV1 and CsTLV2 infections were observed by histology. PCR assays showed the prevalence and geographic range of these viruses, to be restricted to the northeast United States sites sampled. The two virus genomes co-occurred in almost all cases, with the CsTLV2 genome being found on its own in 8.5% cases, and the CsTLV1 genome not yet found on its own. To our knowledge, this is the first report of toti-like viruses in C. sapidus . The information reported here provides the knowledge and tools to investigate transmission and potential pathogenicity of these viruses. 
    more » « less