skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on May 19, 2025

Title: The landscape of gene loss and missense variation across the mammalian tree informs on gene essentiality
ABSTRACT Background

The degree of gene and sequence preservation across species provides valuable insights into the relative necessity of genes from the perspective of natural selection. Here, we developed novel interspecies metrics across 462 mammalian species, GISMO (Gene identity score of mammalian orthologs) and GISMO-mis (GISMO-missense), to quantify gene loss traversing millions of years of evolution. GISMO is a measure of gene loss across mammals weighed by evolutionary distance relative to humans, whereas GISMO-mis quantifies the ratio of missense to synonymous variants across mammalian species for a given gene.

Rationale

Despite large sample sizes, current human constraint metrics are still not well calibrated for short genes. Traversing over 100 million years of evolution across hundreds of mammals can identify the most essential genes and improve gene-disease association. Beyond human genetics, these metrics provide measures of gene constraint to further enable mammalian genetics research.

Results

Our analyses showed that both metrics are strongly correlated with measures of human gene constraint for loss-of-function, missense, and copy number dosage derived from upwards of a million human samples, which highlight the power of interspecies constraint. Importantly, neither GISMO nor GISMO-mis are strongly correlated with coding sequence length. Therefore both metrics can identify novel constrained genes that were too small for existing human constraint metrics to capture. We also found that GISMO scores capture rare variant association signals across a range of phenotypes associated with decreased fecundity, such as schizophrenia, autism, and neurodevelopmental disorders. Moreover, common variant heritability of disease traits are highly enriched in the most constrained deciles of both metrics, further underscoring the biological relevance of these metrics in identifying functionally important genes. We further showed that both scores have the lowest duplication and deletion rate in the most constrained deciles for copy number variants in the UK Biobank, suggesting that it may be an important metric for dosage sensitivity. We additionally demonstrate that GISMO can improve prioritization of recessive disorder genes and captures homozygous selection.

Conclusions

Overall, we demonstrate that the most constrained genes for gene loss and missense variation capture the largest fraction of heritability, GISMO can help prioritize recessive disorder genes, and identify the most conserved genes across the mammalian tree.

 
more » « less
Award ID(s):
2022007
NSF-PAR ID:
10539112
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Thousands of genetic variants have been associated with human diseases and traits through genome-wide association studies (GWASs). Translating these discoveries into improved therapeutics requires discerning which variants among hundreds of candidates are causally related to disease risk. To date, only a handful of causal variants have been confirmed. Here, we leverage 100 million years of mammalian evolution to address this major challenge. RATIONALE We compared genomes from hundreds of mammals and identified bases with unusually few variants (evolutionarily constrained). Constraint is a measure of functional importance that is agnostic to cell type or developmental stage. It can be applied to investigate any heritable disease or trait and is complementary to resources using cell type– and time point–specific functional assays like Encyclopedia of DNA Elements (ENCODE) and Genotype-Tissue Expression (GTEx). RESULTS Using constraint calculated across placental mammals, 3.3% of bases in the human genome are significantly constrained, including 57.6% of coding bases. Most constrained bases (80.7%) are noncoding. Common variants (allele frequency ≥ 5%) and low-frequency variants (0.5% ≤ allele frequency < 5%) are depleted for constrained bases (1.85 versus 3.26% expected by chance, P < 2.2 × 10 −308 ). Pathogenic ClinVar variants are more constrained than benign variants ( P < 2.2 × 10 −16 ). The most constrained common variants are more enriched for disease single-nucleotide polymorphism (SNP)–heritability in 63 independent GWASs. The enrichment of SNP-heritability in constrained regions is greater (7.8-fold) than previously reported in mammals and is even higher in primates (11.1-fold). It exceeds the enrichment of SNP-heritability in nonsynonymous coding variants (7.2-fold) and fine-mapped expression quantitative trait loci (eQTL)–SNPs (4.8-fold). The enrichment peaks near constrained bases, with a log-linear decrease of SNP-heritability enrichment as a function of the distance to a constrained base. Zoonomia constraint scores improve functionally informed fine-mapping. Variants at sites constrained in mammals and primates have greater posterior inclusion probabilities and higher per-SNP contributions. In addition, using both constraint and functional annotations improves polygenic risk score accuracy across a range of traits. Finally, incorporating constraint information into the analysis of noncoding somatic variants in medulloblastomas identifies new candidate driver genes. CONCLUSION Genome-wide measures of evolutionary constraint can help discern which variants are functionally important. This information may accelerate the translation of genomic discoveries into the biological, clinical, and therapeutic knowledge that is required to understand and treat human disease. Using evolutionary constraint in genomic studies of human diseases. ( A ) Constraint was calculated across 240 mammal species, including 43 primates (teal line). ( B ) Pathogenic ClinVar variants ( N = 73,885) are more constrained across mammals than benign variants ( N = 231,642; P < 2.2 × 10 −16 ). ( C ) More-constrained bases are more enriched for trait-associated variants (63 GWASs). ( D ) Enrichment of heritability is higher in constrained regions than in functional annotations (left), even in a joint model with 106 annotations (right). ( E ) Fine-mapping (PolyFun) using a model that includes constraint scores identifies an experimentally validated association at rs1421085. Error bars represent 95% confidence intervals. BMI, body mass index; LF, low frequency; PIP, posterior inclusion probability. 
    more » « less
  2. Abstract Background

    Mammalian gonadal sex is determined by the presence or absence of a Y chromosome and the subsequent production of sex hormones contributes to secondary sexual differentiation. However, sex chromosome-linked genes encoding dosage-sensitive transcription and epigenetic factors are expressed well before gonad formation and have the potential to establish sex-biased expression that persists beyond the appearance of gonadal hormones. Here, we apply a comparative bioinformatics analysis on a pair of published single-cell datasets from mouse and human during very early embryogenesis—from two-cell to pre-implantation stages—to characterize sex-specific signals and to assess the degree of conservation among early acting sex-specific genes and pathways.

    Results

    Clustering and regression analyses of gene expression across samples reveal that sex initially plays a significant role in overall gene expression patterns at the earliest stages of embryogenesis which potentially may be the byproduct of signals from male and female gametes during fertilization. Although these transcriptional sex effects rapidly diminish, sex-biased genes appear to form sex-specific protein–protein interaction networks across pre-implantation stages in both mammals providing evidence that sex-biased expression of epigenetic enzymes may establish sex-specific patterns that persist beyond pre-implantation. Non-negative matrix factorization (NMF) on male and female transcriptomes generated clusters of genes with similar expression patterns across sex and developmental stages, including post-fertilization, epigenetic, and pre-implantation ontologies conserved between mouse and human. While the fraction of sex-differentially expressed genes (sexDEGs) in early embryonic stages is similar and functional ontologies are conserved, the genes involved are generally different in mouse and human.

    Conclusions

    This comparative study uncovers much earlier than expected sex-specific signals in mouse and human embryos that pre-date hormonal signaling from the gonads. These early signals are diverged with respect to orthologs yet conserved in terms of function with important implications in the use of genetic models for sex-specific disease.

     
    more » « less
  3. Premise

    One evolutionary path from hermaphroditism to dioecy is via a gynodioecious intermediate. The evolution of dioecy may also coincide with the formation of sex chromosomes that possess sex‐determining loci that are physically linked in a region of suppressed recombination. Dioecious papaya (Carica papaya) has an XY chromosome system, where the presence of a Y chromosome determines maleness. However, in cultivation, papaya is gynodioecious, due to the conversion of the male Y chromosome to a hermaphroditic Yhchromosome during its domestication.

    Methods

    We investigated gene expression linked to the X, Y, and Yhchromosomes at different floral developmental stages to identify differentially expressed genes that may be involved in the sexual transition of males to hermaphrodites.

    Results

    We identified 309 sex‐biased genes found on the sex chromosomes, most of which are found in the pseudoautosomal regions. Female (XX) expression in the sex‐determining region was almost double that of X‐linked expression in males (XY) and hermaphrodites (XYh), which rules out dosage compensation for most sex‐linked genes; although, an analysis of hemizygous X‐linked loci found evidence of partial dosage compensation. Furthermore, we identified a candidate gene associated with sex determination and the transition to hermaphroditism, a homolog of the MADS‐box proteinSHORT VEGETATIVE PHASE.

    Conclusions

    We identified a pattern of partial dosage compensation for hemizygous genes located in the papaya sex‐determining region. Furthermore, we propose that loss‐of‐expression of the Y‐linkedSHORT VEGETATIVE PHASEhomolog facilitated the transition from males to hermaphrodites in papaya.

     
    more » « less
  4. Abstract Background

    The increase in DNA copy number in Down syndrome (DS; caused by trisomy 21) has led to the DNA dosage hypothesis, which posits that the level of gene expression is proportional to the gene’s DNA copy number. Yet many reports have suggested that a proportion of chromosome 21 genes are dosage compensated back towards typical expression levels (1.0×). In contrast, other reports suggest that dosage compensation is not a common mechanism of gene regulation in trisomy 21, providing support to the DNA dosage hypothesis.

    Results

    In our work, we use both simulated and real data to dissect the elements of differential expression analysis that can lead to the appearance of dosage compensation, even when compensation is demonstrably absent. Using lymphoblastoid cell lines derived from a family with an individual with Down syndrome, we demonstrate that dosage compensation is nearly absent at both nascent transcription (GRO-seq) and steady-state RNA (RNA-seq) levels. Furthermore, we link the limited apparent dosage compensation to expected allelic variation in transcription levels.

    Conclusions

    Transcription dosage compensation does not occur in Down syndrome. Simulated data containing no dosage compensation can appear to have dosage compensation when analyzed via standard methods. Moreover, some chromosome 21 genes that appear to be dosage compensated are consistent with allele specific expression.

     
    more » « less
  5. Abstract

    Sex chromosomes frequently differ from the autosomes in the frequencies of genes with sexually dimorphic or tissue-specific expression. Multiple hypotheses have been put forth to explain the unique gene content of the X chromosome, including selection against male-beneficial X-linked alleles, expression limits imposed by the haploid dosage of the X in males, and interference by the dosage compensation complex on expression in males. Here, we investigate these hypotheses by examining differential gene expression in Drosophila melanogaster following several treatments that have widespread transcriptomic effects: bacterial infection, viral infection, and abiotic stress. We found that genes that are induced (upregulated) by these biotic and abiotic treatments are frequently under-represented on the X chromosome, but so are those that are repressed (downregulated) following treatment. We further show that whether a gene is bound by the dosage compensation complex in males can largely explain the paucity of both up- and downregulated genes on the X chromosome. Specifically, genes that are bound by the dosage compensation complex, or close to a dosage compensation complex high-affinity site, are unlikely to be up- or downregulated after treatment. This relationship, however, could partially be explained by a correlation between differential expression and breadth of expression across tissues. Nonetheless, our results suggest that dosage compensation complex binding, or the associated chromatin modifications, inhibit both up- and downregulation of X chromosome gene expression within specific contexts, including tissue-specific expression. We propose multiple possible mechanisms of action for the effect, including a role of Males absent on the first, a component of the dosage compensation complex, as a dampener of gene expression variance in both males and females. This effect could explain why the Drosophila X chromosome is depauperate in genes with tissue-specific or induced expression, while the mammalian X has an excess of genes with tissue-specific expression.

     
    more » « less