skip to main content

Title: Computational Methods to Study Human Transcript Variants in COVID-19 Infected Lung Cancer Cells
Microbes and viruses are known to alter host transcriptomes by means of infection. In light of recent challenges posed by the COVID-19 pandemic, a deeper understanding of the disease at the transcriptome level is needed. However, research about transcriptome reprogramming by post-transcriptional regulation is very limited. In this study, computational methods developed by our lab were applied to RNA-seq data to detect transcript variants (i.e., alternative splicing (AS) and alternative polyadenylation (APA) events). The RNA-seq data were obtained from a publicly available source, and they consist of mock-treated and SARS-CoV-2 infected (COVID-19) lung alveolar (A549) cells. Data analysis results show that more AS events are found in SARS-CoV-2 infected cells than in mock-treated cells, whereas fewer APA events are detected in SARS-CoV-2 infected cells. A combination of conventional differential gene expression analysis and transcript variants analysis revealed that most of the genes with transcript variants are not differentially expressed. This indicates that no strong correlation exists between differential gene expression and the AS/APA events in the mock-treated or SARS-CoV-2 infected samples. These genes with transcript variants can be applied as another layer of molecular signatures for COVID-19 studies. In addition, the transcript variants are enriched in important biological pathways that were not detected in the studies that only focused on differential gene expression analysis. Therefore, the pathways may lead to new molecular mechanisms of SARS-CoV-2 pathogenesis.  more » « less
Award ID(s):
1755761 1908495 2003749
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
International Journal of Molecular Sciences
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Alternatively spliced genes produce multiple spliced isoforms, called transcript variants. In differential alternative splicing, transcript variant abundance differs across sample types. Differential alternative splicing is common in animal systems and influences cellular development in many processes, but its extent and significance is not as well known in plants. To investigate differential alternative splicing in plants, we examined RNA‐Seq data from rice seedlings. The data included three biological replicates per sample type, approximately 30 million sequence alignments per replicate, and four sample types: roots and shoots treated with exogenous cytokinin delivered hydroponically or a mock treatment. Cytokinin treatment triggered expression changes in thousands of genes but had negligible effect on splicing patterns. However, many genes were differentially spliced between mock‐treated roots and shoots, indicating that our methods were sufficiently sensitive to detect differential splicing between data sets. Quantitative fragment analysis of reverse transcriptase‐PCR products made from newly prepared rice samples confirmed 9 of 10 differential splicing events between rice roots and shoots. Differential alternative splicing typically changed the relative abundance of splice variants that co‐occurred in a data set. Analysis of a similar (but less deeply sequenced) RNA‐Seq data set fromArabidopsisshowed the same pattern. In both theArabidopsisand rice RNA‐Seq data sets, most genes annotated as alternatively spliced had small minor variant frequencies. Of splicing choices with abundant support for minor forms, most alternative splicing events were located within the protein‐coding sequence and maintained the annotated reading frame. A tool for visualizing protein annotations in the context of genomic sequence (ProtAnnot) together with a genome browser (Integrated Genome Browser) were used to visualize and assess effects of differential splicing on gene function. In general, differentially spliced regions coincided with conserved protein domains, indicating that differential alternative splicing is likely to affect protein function between root and shoot tissue in rice.

    more » « less
  2. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the etiological agent responsible for coronavirus disease 2019 (COVID-19), has affected the lives of billions and killed millions of infected people. This virus has been demonstrated to have different outcomes among individuals, with some of them presenting a mild infection, while others present severe symptoms or even death. The identification of the molecular states related to the severity of a COVID-19 infection has become of the utmost importance to understanding the differences in critical immune response. In this study, we computationally processed a set of publicly available single-cell RNA-Seq (scRNA-Seq) data of 12 Bronchoalveolar Lavage Fluid (BALF) samples diagnosed as having a mild, severe, or no infection, and generated a high-quality dataset that consists of 63,734 cells, each with 23,916 genes. We extended the cell-type and sub-type composition identification and our analysis showed significant differences in cell-type composition in mild and severe groups compared to the normal. Importantly, inflammatory responses were dramatically elevated in the severe group, which was evidenced by the significant increase in macrophages, from 10.56% in the normal group to 20.97% in the mild group and 34.15% in the severe group. As an indicator of immune defense, populations of T cells accounted for 24.76% in the mild group and decreased to 7.35% in the severe group. To verify these findings, we developed several artificial neural networks (ANNs) and graph convolutional neural network (GCNN) models. We showed that the GCNN models reach a prediction accuracy of the infection of 91.16% using data from subtypes of macrophages. Overall, our study indicates significant differences in the gene expression profiles of inflammatory response and immune cells of severely infected patients. 
    more » « less
  3. Abstract Background

    The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3′-untranslated region (3′-UTR) of mRNA produces transcripts with shorter or longer 3′-UTR. Often, 3′-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3′-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3′-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3′-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3′-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.


    APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3′-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3′-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3′-UTR annotation and read coverage on the 3′-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at


    APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3′-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3′-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3′-UTR APA events and improve genome annotation.


    APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3′-UTR APA events. The pipeline integrates both RNA-seq and 3′-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots.

    more » « less
  4. Swanson, Michele S. (Ed.)
    ABSTRACT Wastewater surveillance (WS), when coupled with advanced molecular techniques, offers near real-time monitoring of community-wide transmission of SARS-CoV-2 and allows assessing and mitigating COVID-19 outbreaks, by evaluating the total microbial assemblage in a community. Composite wastewater samples (24 h) were collected weekly from a manhole between December 2020 and November 2021 in Maryland, USA. RT-qPCR results showed concentrations of SARS-CoV-2 RNA recovered from wastewater samples reflected incidence of COVID-19 cases. When a drastic increase in COVID-19 was detected in February 2021, samples were selected for microbiome analysis (DNA metagenomics, RNA metatranscriptomics, and targeted SARS-CoV-2 sequencing). Targeted SARS-CoV-2 sequencing allowed for detection of important genetic mutations, such as spike: K417N, D614G, P681H, T716I, S982A, and D1118H, commonly associated with increased cell entry and reinfection. Microbiome analysis (DNA and RNA) provided important insight with respect to human health-related factors, including detection of pathogens and their virulence/antibiotic resistance genes. Specific microbial species comprising the wastewater microbiome correlated with incidence of SARS-CoV-2 RNA, suggesting potential association with SARS-CoV-2 infection. Climatic conditions, namely, temperature, were related to incidence of COVID-19 and detection of SARS-CoV-2 in wastewater, having been monitored as part of an environmental risk score assessment carried out in this study. In summary, the wastewater microbiome provides useful public health information, and hence, a valuable tool to proactively detect and characterize pathogenic agents circulating in a community. In effect, metagenomics of wastewater can serve as an early warning system for communicable diseases, by providing a larger source of information for health departments and public officials. IMPORTANCE Traditionally, testing for COVID-19 is done by detecting SARS-CoV-2 in samples collected from nasal swabs and/or saliva. However, SARS-CoV-2 can also be detected in feces of infected individuals. Therefore, wastewater samples can be used to test all individuals of a community contributing to the sewage collection system, i.e., the infrastructure, such as gravity pipes, manholes, tanks, lift stations, control structures, and force mains, that collects used water from residential and commercial sources and conveys the flow to a wastewater treatment plant. Here, we profile community wastewater collected from a manhole, detect presence of SARS-CoV-2, identify genetic mutations of SARS-CoV-2, and perform COVID-19 risk score assessment of the study area. Using metagenomics analysis, we also detect other microorganisms (bacteria, fungi, protists, and viruses) present in the samples. Results show that by analyzing all microorganisms present in wastewater, pathogens circulating in a community can provide an early warning for contagious diseases. 
    more » « less
  5. Abstract Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species. 
    more » « less