skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 11 until 2:00 AM ET on Saturday, July 12 due to maintenance. We apologize for the inconvenience.


Title: Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues
Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene–variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA .  more » « less
Award ID(s):
2243341
PAR ID:
10401080
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Life Science Alliance
Volume:
5
Issue:
12
ISSN:
2575-1077
Page Range / eLocation ID:
e202101297
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The regulation of gene expression is central to many biological processes. Gene regulatory networks (GRNs) link transcription factors (TFs) to their target genes and represent maps of potential transcriptional regulation. Here, we analyzed a large number of publically available maize (Zea mays) transcriptome data sets including >6000 RNA sequencing samples to generate 45 coexpression-based GRNs that represent potential regulatory relationships between TFs and other genes in different populations of samples (cross-tissue, cross-genotype, and tissue-and-genotype samples). While these networks are all enriched for biologically relevant interactions, different networks capture distinct TF-target associations and biological processes. By examining the power of our coexpression-based GRNs to accurately predict covarying TF-target relationships in natural variation data sets, we found that presence/absence changes rather than quantitative changes in TF gene expression are more likely associated with changes in target gene expression. Integrating information from our TF-target predictions and previous expression quantitative trait loci (eQTL) mapping results provided support for 68 TFs underlying 74 previously identified trans-eQTL hotspots spanning a variety of metabolic pathways. This study highlights the utility of developing multiple GRNs within a species to detect putative regulators of important plant pathways and provides potential targets for breeding or biotechnological applications. 
    more » « less
  2. Background: Coronary artery disease (CAD) is the leading cause of death worldwide. Recent meta-analyses of genome-wide association studies have identified over 175 loci associated with CAD. The majority of these loci are in noncoding regions and are predicted to regulate gene expression. Given that vascular smooth muscle cells (SMCs) play critical roles in the development and progression of CAD, we aimed to identify the subset of the CAD loci associated with the regulation of transcription in distinct SMC phenotypes. Methods: We measured gene expression in SMCs isolated from the ascending aortas of 151 heart transplant donors of various genetic ancestries in quiescent or proliferative conditions and calculated the association of their expression and splicing with ~6.3 million imputed single-nucleotide polymorphism markers across the genome. Results: We identified 4910 expression and 4412 splicing quantitative trait loci (sQTLs) representing regions of the genome associated with transcript abundance and splicing. A total of 3660 expression quantitative trait loci (eQTLs) had not been observed in the publicly available Genotype-Tissue Expression dataset. Further, 29 and 880 eQTLs were SMC-specific and sex-biased, respectively. We made these results available for public query on a user-friendly website. To identify the effector transcript(s) regulated by CAD loci, we used 4 distinct colocalization approaches. We identified 84 eQTL and 164 sQTL that colocalized with CAD loci, highlighting the importance of genetic regulation of mRNA splicing as a molecular mechanism for CAD genetic risk. Notably, 20% and 35% of the eQTLs were unique to quiescent or proliferative SMCs, respectively. One CAD locus colocalized with a sex-specific eQTL ( TERF2IP ), and another locus colocalized with SMC-specific eQTL ( ALKBH8 ). The most significantly associated CAD locus, 9p21, was an sQTL for the long noncoding RNA CDKN2B-AS1 , also known as ANRIL , in proliferative SMCs. Conclusions: Collectively, our results provide evidence for the molecular mechanisms of genetic susceptibility to CAD in distinct SMC phenotypes. 
    more » « less
  3. The MHC region is highly associated with autoimmune and infectious diseases. Here we conduct an in-depth interrogation of associations between genetic variation, gene expression and disease. We create a comprehensive map of regulatory variation in the MHC region using WGS from 419 individuals to call eight-digit HLA types and RNA-seq data from matched iPSCs. Building on this regulatory map, we explored GWAS signals for 4083 traits, detecting colocalization for 180 disease loci with eQTLs. We show that eQTL analyses taking HLA type haplotypes into account have substantially greater power compared with only using single variants. We examined the association between the 8.1 ancestral haplotype and delayed colonization in Cystic Fibrosis, postulating that downregulation of RNF5 expression is the likely causal mechanism. Our study provides insights into the genetic architecture of the MHC region and pinpoints disease associations that are due to differential expression of HLA genes and non-HLA genes. 
    more » « less
  4. Abstract Allele-specific expression quantification from RNA-seq reads provides opportunities to study the control of gene regulatory networks bycis-acting andtrans-acting genetic variants. Many existing methods performed a single-gene and single-SNP association analysis to identify expression quantitative trait loci (eQTLs), and placed the eQTLs against known gene networks for functional interpretation. Instead, we view eQTL data as a capture of the effects of perturbation of gene regulatory system by a large number of genetic variants and reconstruct a gene network perturbed by eQTLs. We introduce a statistical framework called CiTruss for simultaneously learning a gene network andcis-acting andtrans-acting eQTLs that perturb this network, given population allele-specific expression and SNP data. CiTruss uses a multi-level conditional Gaussian graphical model to modeltrans-acting eQTLs perturbing the expression of both alleles in gene network at the top level andcis-acting eQTLs perturbing the expression of each allele at the bottom level. We derive a transformation of this model that allows efficient learning for large-scale human data. Our analysis of the GTEx and LG×SM advanced intercross line mouse data for multiple tissue types with CiTruss provides new insights into genetics of gene regulation. CiTruss revealed that gene networks consist of local subnetworks over proximally located genes and global subnetworks over genes scattered across genome, and that several aspects of gene regulation by eQTLs such as the impact of genetic diversity, pleiotropy, tissue-specific gene regulation, and local and long-range linkage disequilibrium among eQTLs can be explained through these local and global subnetworks. 
    more » « less
  5. null (Ed.)
    “Long tail” data are considered to be smaller, heterogeneous, researcher‐held data, which present unique data management and scholarly communication challenges. These data are presumably concentrated within relatively lower‐funded projects due to insufficient resources for curation. To better understand the nature and distribution of long tail data, we examine National Science Foundation (NSF) funding patterns using Latent Dirichlet Allocation (LDA) and bibliographic data. We also introduce the concept of “Topic Investment” to capture differences in topics across funding levels and to illuminate the distribution of funding across topics. This study uses the discipline of astronomy as a case study, overall exploring possible associations between topic, funding level and research output, with implications for research policy and practice. We find that while different topics demonstrate different funding levels and publication patterns, dynamics predicted by the “long tail” theoretical framework presented here can be observed within NSF‐funded topics in astronomy. 
    more » « less