skip to main content

Title: CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at

more » « less
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Genome Biology
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Rice is an important cereal crop, being a staple food for over half of the world's population, and sexual reproduction resulting in grain formation underpins global food security. However, despite considerable research efforts, many of the genes, especially long intergenic non‐codingRNA(lincRNA) genes, involved in sexual reproduction in rice remain uncharacterized. With an increasing number of public resources becoming available, information from different sources can be combined to perform gene functional annotation. We report the development of MCRiceRepGP, a machine learning framework which integrates heterogeneous evidence and employs multicriteria decision analysis and machine learning to predict coding and lincRNA genes involved in sexual reproduction in rice. The rice genome was reannotated using deep‐sequencing transcriptomic data from reproduction‐associated tissue/cell types identifying previously unannotated putative protein‐coding genes and lincRNAs. MCRiceRepGP was used for genome‐wide discovery of sexual reproduction associated coding and lincRNA genes. The protein‐coding and lincRNA genes identified have distinct expression profiles, with a large proportion of lincRNAs reaching maximum expression levels in the sperm cells. Some of the genes are potentially linked to male‐ and female‐specific fertility and heat stress tolerance during the reproductive stage. MCRiceRepGP can be used in combination with other genome‐wide studies, such as genome‐wide association studies, giving greater confidence that the genes identified are associated with the biological process of interest. As more data, especially about mutant plant phenotypes, become available, the power of MCRiceRepGP will grow, providing researchers with a tool to identify candidate genes for future experiments. MCRiceRepGP is available as a web application (

    more » « less
  2. Abstract

    As a model organism for studies of cell and environmental biology, the free‐living and cosmopolitan ciliateEuplotes vannusshows intriguing features like dual genome architecture (i.e., separate germline and somatic nuclei in each cell/organism), “gene‐sized” chromosomes, stop codon reassignment, programmed ribosomal frameshifting (PRF) and strong resistance to environmental stressors. However, the molecular mechanisms that account for these remarkable traits remain largely unknown. Here we report a combined analysis of de novo assembled high‐quality macronuclear (MAC; i.e., somatic) and partial micronuclear (MIC; i.e., germline) genome sequences forE. vannus, and transcriptome profiling data under varying conditions. The results demonstrate that: (a) the MAC genome contains more than 25,000 complete “gene‐sized” nanochromosomes (~85 Mb haploid genome size) with the N50 ~2.7 kb; (b) although there is a high frequency of frameshifting at stop codons UAA and UAG, we did not observe impaired transcript abundance as a result of PRF in this species as has been reported for other euplotids; (c) the sequence motif 5′‐TA‐3′ is conserved at nearly all internally‐eliminated sequence (IES) boundaries in the MIC genome, and chromosome breakage sites (CBSs) are duplicated and retained in the MAC genome; (d) by profiling the weighted correlation network of genes in the MAC under different environmental stressors, including nutrient scarcity, extreme temperature, salinity and the presence of ammonia, we identified gene clusters that respond to these external physical or chemical stimulations, and (e) we observed a dramatic increase in HSP70 gene transcription under salinity and chemical stresses but surprisingly, not under temperature changes; we link this temperature‐resistance to the evolved loss of temperature stress‐sensitive elements in regulatory regions. Together with the genome resources generated in this study, which are available online atEuplotes vannusGenome Database (, these data provide molecular evidence for understanding the unique biology of highly adaptable microorganisms.

    more » « less
  3. Abstract Background

    The pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species.


    Here we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations.


    The PGV software can be installed via conda or downloaded from The companion PGV browser athttp://pgv.cs.ucr.educan be tested using example bed tracks available from the GitHub page.

    more » « less
  4. Abstract Background

    The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3′-untranslated region (3′-UTR) of mRNA produces transcripts with shorter or longer 3′-UTR. Often, 3′-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3′-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3′-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3′-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3′-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.


    APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3′-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3′-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3′-UTR annotation and read coverage on the 3′-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at


    APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3′-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3′-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3′-UTR APA events and improve genome annotation.


    APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3′-UTR APA events. The pipeline integrates both RNA-seq and 3′-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots.

    more » « less
  5. Abstract

    A phylogenetic analysis of selected oestroid taxa based on 66 morphological traits and sequences from three nuclear protein‐coding genes (CAD,MAC,MCS) resolved the composition and phylogenetic position of the former subfamily Polleniinae of the Calliphoridae – here resurrected at family rank as Polleniidae Brauer & Bergenstamm, 1889stat. rev.Six species are transferred from the family Rhinophoridae to the Polleniidae: the Palaearctic genusAlvamajaRognes, along with its single speciesAlvamaja chlorometallicaRognes, and five Afrotropical species comprising thecarinata‐group formerly in the genusPhytoRobineau‐Desvoidy but here assigned to genusMoriniaRobineau‐Desvoidy, i.e.M. carinata(Pape, 1987)comb.n.,M. lactineala(Pape, 1997)comb.n.,M. longirostris(Crosskey, 1977)comb.n.,M. royi(Pape, 1997)comb.n.andM. stuckenbergi(Crosskey, 1977)comb.n.The Polleniidae are monophyletic and, in agreement with most recent phylogenetic reconstructions, sister to the Tachinidae. The female ofA. chlorometallicaand a new species ofMoriniaof thecarinata‐group (Morinia tsitsikammasp.n.from South Africa) are described.

    This published work has been registered in ZooBank,‐DEE4‐4B0C‐88EA‐35FDE298EBC5.

    more » « less