skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 17, 2026

Title: Generative prediction of causal gene sets responsible for complex traits
The relationship between genotype and phenotype remains an outstanding question for organism-level traits because these traits are generallycomplex. The challenge arises from complex traits being determined by a combination of multiple genes (or loci), which leads to an explosion of possible genotype–phenotype mappings. The primary techniques to resolve these mappings are genome/transcriptome-wide association studies, which are limited by their lack of causal inference and statistical power. Here, we develop an approach that combines transcriptional data endowed with causal information and a generative machine learning model designed to strengthen statistical power. Our implementation of the approach—dubbed transcriptome-wide conditional variational autoencoder (TWAVE)—includes a variational autoencoder trained on human transcriptional data, which is incorporated into an optimization framework. Given a trait phenotype, TWAVE generates expression profiles, which we dimensionally reduce by identifying independently varying generalized pathways (eigengenes). We then conduct constrained optimization to find causal gene sets that are the gene perturbations whose measured transcriptomic responses best explain trait phenotype differences. By considering several complex traits, we show that the approach identifies causal genes that cannot be detected by the primary existing techniques. Moreover, the approach identifies complex diseases caused by distinct sets of genes, meaning that the disease is polygenicandexhibits distinct subtypes driven by different genotype–phenotype mappings. We suggest that the approach will enable the design of tailored experiments to identify multigenic targets to address complex diseases.  more » « less
Award ID(s):
2235451
PAR ID:
10609111
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Proceedings of the National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences of the United States of America
Volume:
122
Issue:
24
ISSN:
0027-8424
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Leslie, Christina S. (Ed.)
    Gene regulatory network inference is essential to uncover complex relationships among gene pathways and inform downstream experiments, ultimately enabling regulatory network re-engineering. Network inference from transcriptional time-series data requires accurate, interpretable, and efficient determination of causal relationships among thousands of genes. Here, we develop Bootstrap Elastic net regression from Time Series (BETS), a statistical framework based on Granger causality for the recovery of a directed gene network from transcriptional time-series data. BETS uses elastic net regression and stability selection from bootstrapped samples to infer causal relationships among genes. BETS is highly parallelized, enabling efficient analysis of large transcriptional data sets. We show competitive accuracy on a community benchmark, the DREAM4 100-gene network inference challenge, where BETS is one of the fastest among methods of similar performance and additionally infers whether causal effects are activating or inhibitory. We apply BETS to transcriptional time-series data of differentially-expressed genes from A549 cells exposed to glucocorticoids over a period of 12 hours. We identify a network of 2768 genes and 31,945 directed edges (FDR ≤ 0.2). We validate inferred causal network edges using two external data sources: Overexpression experiments on the same glucocorticoid system, and genetic variants associated with inferred edges in primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project. BETS is available as an open source software package at https://github.com/lujonathanh/BETS . 
    more » « less
  2. Many remarkable phenotypes have repeatedly occurred across vast evolutionary distances. When convergent traits emerge on the tree of life, they are sometimes driven by the same underlying gene families, while other times, many different gene families are involved. Conversely, a gene family may be repeatedly recruited for a single trait or many different traits. To understand the general rules governing convergence at both genomic and phenotypic levels, we systematically tested associations between 56 binary metabolic traits and gene count in 14,785 gene families from 993 Saccharomycotina yeasts. Using a recently developed phylogenetic approach that reduces spurious correlations, we found that gene family expansion and contraction were significantly linked to trait gain and loss in 45/56 (80%) traits. While 595/739 (81%) significant gene families were associated with only one trait, we also identified several “keystone” gene families that were significantly associated with up to 13/56 (23%) of all traits. Strikingly, most of these families are known to encode metabolic enzymes and transporters, including all members of the industrially relevantMALtose fermentation loci in the baker’s yeastSaccharomyces cerevisiae. These results indicate that convergent evolution on the gene family level may be more widespread across deeper timescales than previously believed. 
    more » « less
  3. Mapping the genetic basis of complex traits is critical to uncovering the biological mechanisms that underlie disease and other phenotypes. Genome-wide association studies (GWAS) in humans and quantitative trait locus (QTL) mapping in model organisms can now explain much of the observed heritability in many traits, allowing us to predict phenotype from genotype. However, constraints on power due to statistical confounders in large GWAS and smaller sample sizes in QTL studies still limit our ability to resolve numerous small-effect variants, map them to causal genes, identify pleiotropic effects across multiple traits, and infer non-additive interactions between loci (epistasis). Here, we introduce barcoded bulk quantitative trait locus (BB-QTL) mapping, which allows us to construct, genotype, and phenotype 100,000 offspring of a budding yeast cross, two orders of magnitude larger than the previous state of the art. We use this panel to map the genetic basis of eighteen complex traits, finding that the genetic architecture of these traits involves hundreds of small-effect loci densely spaced throughout the genome, many with widespread pleiotropic effects across multiple traits. Epistasis plays a central role, with thousands of interactions that provide insight into genetic networks. By dramatically increasing sample size, BB-QTL mapping demonstrates the potential of natural variants in high-powered QTL studies to reveal the highly polygenic, pleiotropic, and epistatic architecture of complex traits. 
    more » « less
  4. Svensson, Sarah L (Ed.)
    ABSTRACT In starvingBacillus subtilisbacteria,the initiation of two survival programs—biofilm formation and sporulation—is controlled by the same phosphorylated master regulator, Spo0A~P. Its gene,spo0A,is transcribed from two promoters, Pvand Ps,that are, respectively, regulated by RNA polymerase (RNAP) holoenzymes bearing σAand σH. Notably, transcription is directly autoregulated by Spo0A~P binding sites known as 0A1, 0A2, and 0A3 box, located in between the two promoters. It remains unclear whether, at the onset of starvation, these boxes activate or repressspo0Aexpression, and whether the Spo0A~P transcriptional feedback plays a role in the increase inspo0Aexpression. Based on the experimental data of the promoter activities under systematic perturbation of the promoter architecture, we developed a biophysical model of transcriptional regulation ofspo0Aby Spo0A~P binding to each of the 0A boxes. The model predicts that Spo0A~P binding to its boxes does not affect the RNAP recruitment to the promoters but instead affects the transcriptional initiation rate. Moreover, the effects of Spo0A~P binding to 0A boxes are mainly repressive and saturated early at the onset of starvation. Therefore, the increase inspo0Aexpression is mainly driven by the increase in RNAP holoenzyme levels. Additionally, we reveal that Spo0A~P affinity to 0A boxes is strongest at 0A3 and weakest at 0A2 and that there are attractive forces between the occupied 0A boxes. Our findings, in addition to clarifying how the sporulation master regulator is controlled, offer a framework to predict regulatory outcomes of complex gene-regulatory mechanisms. IMPORTANCECell differentiation is often critical for survival. In bacteria, differentiation decisions are controlled by transcriptional master regulators under transcriptional feedback control. Therefore, understanding how master regulators are transcriptionally regulated is required to understand differentiation. However, in many cases, the underlying regulation is complex, with multiple transcription factor binding sites and multiple promoters, making it challenging to dissect the exact mechanisms. Here, we address this problem for theBacillus subtilismaster regulator Spo0A. Using a biophysical model, we quantitatively characterize the effect of individual transcription factor binding sites on eachspo0Apromoter. Furthermore, the model allows us to identify the specific transcription step that is affected by transcription factor binding. Such a model is promising for the quantitative study of a wide range of master regulators involved in transcriptional feedback. 
    more » « less
  5. Abstract Transcriptome-wide association studies (TWASs) integrate expression quantitative trait loci (eQTLs) studies with genome-wide association studies (GWASs) to prioritize candidate target genes for complex traits. Several statistical methods have been recently proposed to improve the performance of TWASs in gene prioritization by integrating the expression regulatory information imputed from multiple tissues, and made significant achievements in improving the ability to detect gene-trait associations. Unfortunately, most existing multi-tissue methods focus on prioritization of candidate genes, and cannot directly infer the specific functional effects of candidate genes across different tissues. Here, we propose a tissue-specific collaborative mixed model (TisCoMM) for TWASs, leveraging the co-regulation of genetic variations across different tissues explicitly via a unified probabilistic model. TisCoMM not only performs hypothesis testing to prioritize gene-trait associations, but also detects the tissue-specific role of candidate target genes in complex traits. To make full use of widely available GWASs summary statistics, we extend TisCoMM to use summary-level data, namely, TisCoMM-S2. Using extensive simulation studies, we show that type I error is controlled at the nominal level, the statistical power of identifying associated genes is greatly improved, and the false-positive rate (FPR) for non-causal tissues is well controlled at decent levels. We further illustrate the benefits of our methods in applications to summary-level GWASs data of 33 complex traits. Notably, apart from better identifying potential trait-associated genes, we can elucidate the tissue-specific role of candidate target genes. The follow-up pathway analysis from tissue-specific genes for asthma shows that the immune system plays an essential function for asthma development in both thyroid and lung tissues. 
    more » « less