skip to main content


Title: A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs
Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.  more » « less
Award ID(s):
1750472
NSF-PAR ID:
10083992
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Life Science Alliance
Volume:
2
Issue:
1
ISSN:
2575-1077
Page Range / eLocation ID:
e201800175
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitochondrial and plastid functions depend on coordinated expression of proteins encoded by genomic compartments that have radical differences in copy number of organellar and nuclear genomes. In polyploids, doubling of the nuclear genome may add challenges to maintaining balanced expression of proteins involved in cytonuclear interactions. Here, we use ribo-depleted RNA sequencing (RNA-seq) to analyze transcript abundance for nuclear and organellar genomes in leaf tissue from four different polyploid angiosperms and their close diploid relatives. We find that even though plastid genomes contain <1% of the number of genes in the nuclear genome, they generate the majority (69.9 to 82.3%) of messenger RNA (mRNA) transcripts in the cell. Mitochondrial genes are responsible for a much smaller percentage (1.3 to 3.7%) of the leaf mRNA pool but still produce much higher transcript abundances per gene compared to nuclear genome. Nuclear genes encoding proteins that functionally interact with mitochondrial or plastid gene products exhibit mRNA expression levels that are consistently more than 10-fold lower than their organellar counterparts, indicating an extreme cytonuclear imbalance at the RNA level despite the predominance of equimolar interactions at the protein level. Nevertheless, interacting nuclear and organellar genes show strongly correlated transcript abundances across functional categories, suggesting that the observed mRNA stoichiometric imbalance does not preclude coordination of cytonuclear expression. Finally, we show that nuclear genome doubling does not alter the cytonuclear expression ratios observed in diploid relatives in consistent or systematic ways, indicating that successful polyploid plants are able to compensate for cytonuclear perturbations associated with nuclear genome doubling. 
    more » « less
  2. Li, Min (Ed.)
    Experimental single-cell approaches are becoming widely used for many purposes, including investigation of the dynamic behaviour of developing biological systems. Consequently, a large number of computational methods for extracting dynamic information from such data have been developed. One example is RNA velocity analysis, in which spliced and unspliced RNA abundances are jointly modeled in order to infer a ‘direction of change’ and thereby a future state for each cell in the gene expression space. Naturally, the accuracy and interpretability of the inferred RNA velocities depend crucially on the correctness of the estimated abundances. Here, we systematically compare five widely used quantification tools, in total yielding thirteen different quantification approaches, in terms of their estimates of spliced and unspliced RNA abundances in five experimental droplet scRNA-seq data sets. We show that there are substantial differences between the quantifications obtained from different tools, and identify typical genes for which such discrepancies are observed. We further show that these abundance differences propagate to the downstream analysis, and can have a large effect on estimated velocities as well as the biological interpretation. Our results highlight that abundance quantification is a crucial aspect of the RNA velocity analysis workflow, and that both the definition of the genomic features of interest and the quantification algorithm itself require careful consideration. 
    more » « less
  3. Abstract

    Candida glabratais an opportunistic pathogen in humans, responsible for approximately 20% of disseminated candidiasis.Candida glabrata'sability to adhere to host tissue is mediated by GPI‐anchored cell wall proteins (GPI‐CWPs); the corresponding genes contain long tandem repeat regions. These repeat regions resulted in assembly errors in the reference genome. Here, we performed a de novo assembly of theC. glabratatype strain CBS138 using long single‐molecule real‐time reads, with short read sequences (Illumina) for refinement, and constructed telomere‐to‐telomere assemblies of all 13 chromosomes. Our assembly has excellent agreement overall with the current reference genome, but we made substantial corrections within tandem repeat regions. Specifically, we removed 62 genes of which 45 were scrambled due to misassembly in the reference. We annotated 31 novel ORFs of which 24 ORFs are GPI‐CWPs. In addition, we corrected the tandem repeat structure of an additional 21 genes. Our corrections to the genome were substantial, with the length of new genes and tandem repeat corrections amounting to approximately 3.8% of the ORFeome length. As most corrections were within the coding regions of GPI‐CWP genes, our genome assembly establishes a high‐quality reference set of genes and repeat structures for the functional analysis of these cell surface proteins.

     
    more » « less
  4. Abstract Motivation Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. Results We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. Availability and implementation The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0. 
    more » « less
  5. Abstract

    Alternatively spliced genes produce multiple spliced isoforms, called transcript variants. In differential alternative splicing, transcript variant abundance differs across sample types. Differential alternative splicing is common in animal systems and influences cellular development in many processes, but its extent and significance is not as well known in plants. To investigate differential alternative splicing in plants, we examined RNA‐Seq data from rice seedlings. The data included three biological replicates per sample type, approximately 30 million sequence alignments per replicate, and four sample types: roots and shoots treated with exogenous cytokinin delivered hydroponically or a mock treatment. Cytokinin treatment triggered expression changes in thousands of genes but had negligible effect on splicing patterns. However, many genes were differentially spliced between mock‐treated roots and shoots, indicating that our methods were sufficiently sensitive to detect differential splicing between data sets. Quantitative fragment analysis of reverse transcriptase‐PCR products made from newly prepared rice samples confirmed 9 of 10 differential splicing events between rice roots and shoots. Differential alternative splicing typically changed the relative abundance of splice variants that co‐occurred in a data set. Analysis of a similar (but less deeply sequenced) RNA‐Seq data set fromArabidopsisshowed the same pattern. In both theArabidopsisand rice RNA‐Seq data sets, most genes annotated as alternatively spliced had small minor variant frequencies. Of splicing choices with abundant support for minor forms, most alternative splicing events were located within the protein‐coding sequence and maintained the annotated reading frame. A tool for visualizing protein annotations in the context of genomic sequence (ProtAnnot) together with a genome browser (Integrated Genome Browser) were used to visualize and assess effects of differential splicing on gene function. In general, differentially spliced regions coincided with conserved protein domains, indicating that differential alternative splicing is likely to affect protein function between root and shoot tissue in rice.

     
    more » « less