skip to main content

Title: A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs
Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Life Science Alliance
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitochondrial and plastid functions depend on coordinated expression of proteins encoded by genomic compartments that have radical differences in copy number of organellar and nuclear genomes. In polyploids, doubling of the nuclear genome may add challenges to maintaining balanced expression of proteins involved in cytonuclear interactions. Here, we use ribo-depleted RNA sequencing (RNA-seq) to analyze transcript abundance for nuclear and organellar genomes in leaf tissue from four different polyploid angiosperms and their close diploid relatives. We find that even though plastid genomes contain <1% of the number of genes in the nuclear genome, they generate the majority (69.9 to 82.3%) of messenger RNA (mRNA) transcripts in the cell. Mitochondrial genes are responsible for a much smaller percentage (1.3 to 3.7%) of the leaf mRNA pool but still produce much higher transcript abundances per gene compared to nuclear genome. Nuclear genes encoding proteins that functionally interact with mitochondrial or plastid gene products exhibit mRNA expression levels that are consistently more than 10-fold lower than their organellar counterparts, indicating an extreme cytonuclear imbalance at the RNA level despite the predominance of equimolar interactions at the protein level. Nevertheless, interacting nuclear and organellar genes show strongly correlated transcript abundances across functional categories, suggesting that the observed mRNA stoichiometric imbalance does not preclude coordination of cytonuclear expression. Finally, we show that nuclear genome doubling does not alter the cytonuclear expression ratios observed in diploid relatives in consistent or systematic ways, indicating that successful polyploid plants are able to compensate for cytonuclear perturbations associated with nuclear genome doubling. 
    more » « less
  2. Li, Min (Ed.)
    Experimental single-cell approaches are becoming widely used for many purposes, including investigation of the dynamic behaviour of developing biological systems. Consequently, a large number of computational methods for extracting dynamic information from such data have been developed. One example is RNA velocity analysis, in which spliced and unspliced RNA abundances are jointly modeled in order to infer a ‘direction of change’ and thereby a future state for each cell in the gene expression space. Naturally, the accuracy and interpretability of the inferred RNA velocities depend crucially on the correctness of the estimated abundances. Here, we systematically compare five widely used quantification tools, in total yielding thirteen different quantification approaches, in terms of their estimates of spliced and unspliced RNA abundances in five experimental droplet scRNA-seq data sets. We show that there are substantial differences between the quantifications obtained from different tools, and identify typical genes for which such discrepancies are observed. We further show that these abundance differences propagate to the downstream analysis, and can have a large effect on estimated velocities as well as the biological interpretation. Our results highlight that abundance quantification is a crucial aspect of the RNA velocity analysis workflow, and that both the definition of the genomic features of interest and the quantification algorithm itself require careful consideration. 
    more » « less
  3. Jewel wasps in the genus of Nasonia are parasitoids with haplodiploidy sex determination, rapid development and are easy to culture in the laboratory. They are excellent models for insect genetics, genomics, epigenetics, development, and evolution. Nasonia vitripennis ( Nv ) and N. giraulti ( Ng ) are closely-related species that can be intercrossed, particularly after removal of the intracellular bacterium Wolbachia , which serve as a powerful tool to map and positionally clone morphological, behavioral, expression and methylation phenotypes. The Nv reference genome was assembled using Sanger, PacBio and Nanopore approaches and annotated with extensive RNA-seq data. In contrast, Ng genome is only available through low coverage resequencing. Therefore, de novo Ng assembly is in urgent need to advance this system. In this study, we report a high-quality Ng assembly using 10X Genomics linked-reads with 670X sequencing depth. The current assembly has a genome size of 259,040,977 bp in 3,160 scaffolds with 38.05% G-C and a 98.6% BUSCO completeness score. 97% of the RNA reads are perfectly aligned to the genome, indicating high quality in contiguity and completeness. A total of 14,777 genes are annotated in the Ng genome, and 72% of the annotated genes have a one-to-one ortholog in the Nv genome. We reported 5 million Ng-Nv SNPs which will facility mapping and population genomic studies in Nasonia . In addition, 42 Ng -specific genes were identified by comparing with Nv genome and annotation. This is the first de novo assembly for this important species in the Nasonia model system, providing a useful new genomic toolkit. 
    more » « less
  4. Abstract Background There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Results We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models. Conclusions Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth. 
    more » « less
  5. Abstract

    Candida glabratais an opportunistic pathogen in humans, responsible for approximately 20% of disseminated candidiasis.Candida glabrata'sability to adhere to host tissue is mediated by GPI‐anchored cell wall proteins (GPI‐CWPs); the corresponding genes contain long tandem repeat regions. These repeat regions resulted in assembly errors in the reference genome. Here, we performed a de novo assembly of theC. glabratatype strain CBS138 using long single‐molecule real‐time reads, with short read sequences (Illumina) for refinement, and constructed telomere‐to‐telomere assemblies of all 13 chromosomes. Our assembly has excellent agreement overall with the current reference genome, but we made substantial corrections within tandem repeat regions. Specifically, we removed 62 genes of which 45 were scrambled due to misassembly in the reference. We annotated 31 novel ORFs of which 24 ORFs are GPI‐CWPs. In addition, we corrected the tandem repeat structure of an additional 21 genes. Our corrections to the genome were substantial, with the length of new genes and tandem repeat corrections amounting to approximately 3.8% of the ORFeome length. As most corrections were within the coding regions of GPI‐CWP genes, our genome assembly establishes a high‐quality reference set of genes and repeat structures for the functional analysis of these cell surface proteins.

    more » « less