Abstract Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
more »
« less
NBBt-test: a versatile method for differential analysis of multiple types of RNA-seq data
Abstract Rapid development of transcriptome sequencing technologies has resulted in a data revolution and emergence of new approaches to study transcriptomic regulation such as alternative splicing, alternative polyadenylation, CRISPR knockout screening in addition to the regular gene expression. A full characterization of the transcriptional landscape of different groups of cells or tissues holds enormous potential for both basic science as well as clinical applications. Although many methods have been developed in the realm of differential gene expression analysis, they all geared towards a particular type of sequencing data and failed to perform well when applied in different types of transcriptomic data. To fill this gap, we offer a negative beta binomial t-test (NBBt-test). NBBt-test provides multiple functions to perform differential analyses of alternative splicing, polyadenylation, CRISPR knockout screening, and gene expression datasets. Both real and large-scale simulation data show superior performance of NBBt-test with higher efficiency, and lower type I error rate and FDR to identify differential isoforms and differentially expressed genes and differential CRISPR knockout screening genes with different sample sizes when compared against the current very popular statistical methods. An R-package implementing NBBt-test is available for downloading from CRAN ( https://CRAN.R-project.org/package=NBBttest ).
more »
« less
- Award ID(s):
- 1557417
- PAR ID:
- 10350945
- Date Published:
- Journal Name:
- Scientific Reports
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2045-2322
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.more » « less
-
Regulation of gene expression is a critical link between genotype and phenotype explaining substantial heritable variation within species. However, we are only beginning to understand the ways that specific gene regulatory mechanisms contribute to adaptive divergence of populations. In plants, the post-transcriptional regulatory mechanism of alternative splicing (AS) plays an important role in both development and abiotic stress response, making it a compelling potential target of natural selection. AS allows organisms to generate multiple different transcripts/proteins from a single gene and thus may provide a source of evolutionary novelty. Here, we examine whether variation in alternative splicing and gene expression levels might contribute to adaptation and incipient speciation of dune-adapted prairie sunflowers in Great Sand Dunes National Park, Colorado, USA. We conducted a common garden experiment to assess transcriptomic variation among ecotypes and analyzed differential expression, differential splicing, and gene coexpression. We show that individual genes are strongly differentiated for both transcript level and alternative isoform proportions, even when grown in a common environment, and that gene coexpression networks are disrupted between ecotypes. Furthermore, we examined how genome-wide patterns of sequence divergence correspond to divergence in transcript levels and isoform proportions and find evidence for both cis and trans-regulation. Together, our results emphasize that alternative splicing has been an underappreciated mechanism providing source material for natural selection at short evolutionary time scales.more » « less
-
Understanding the relationship between mutations and their genomic and phenotypic consequences has been a longstanding goal of evolutionary biology. However, few studies have investigated the impact of mutations on gene expression and alternative splicing on the genome-wide scale. In this study, we aim to bridge this knowledge gap by utilizing whole-genome sequencing data and RNA sequencing data from 16 obligately parthenogeneticDaphniamutant lines to investigate the effects of ethyl methanesulfonate-induced mutations on gene expression and alternative splicing. Using rigorous analyses of mutations, expression changes and alternative splicing, we show that trans-effects are the major contributor to the variance in gene expression and alternative splicing between the wild-type and mutant lines, whereas cis mutations only affected a limited number of genes and do not always alter gene expression. Moreover, we show that there is a significant association between differentially expressed genes and exonic mutations, indicating that exonic mutations are an important driver of altered gene expression.more » « less
-
Differential polyadenylation sites (PAs) critically regulate gene expression, but their cell type–specific usage and spatial distribution in the brain have not been systematically characterized. Here, we present Infernape, which infers and quantifies PA usage from single-cell and spatial transcriptomic data and show its application in the mouse brain. Infernape uncovers alternative intronic PAs and 3′-UTR lengthening during cortical neurogenesis. Progenitor–neuron comparisons in the excitatory and inhibitory neuron lineages show overlapping PA changes in embryonic brains, suggesting that the neural proliferation–differentiation axis plays a prominent role. In the adult mouse brain, we uncover cell type–specific PAs and visualize such events using spatial transcriptomic data. Over two dozen neurodevelopmental disorder–associated genes such as Csnk2a1 and Mecp2 show differential PAs during brain development. This study presents Infernape to identify PAs from scRNA-seq and spatial data, and highlights the role of alternative PAs in neuronal gene regulation.more » « less