skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Evaluating methods for integrating single-cell data and genetics to understand inflammatory disease complexity
BackgroundUnderstanding genetic underpinnings of immune-mediated inflammatory diseases is crucial to improve treatments. Single-cell RNA sequencing (scRNA-seq) identifies cell states expanded in disease, but often overlooks genetic causality due to cost and small genotyping cohorts. Conversely, large genome-wide association studies (GWAS) are commonly accessible. MethodsWe present a 3-step robust benchmarking analysis of integrating GWAS and scRNA-seq to identify genetically relevant cell states and genes in inflammatory diseases. First, we applied and compared the results of three recent algorithms, based on pathways (scGWAS), single-cell disease scores (scDRS), or both (scPagwas), according to accuracy/sensitivity and interpretability. While previous studies focused on coarse cell types, we used disease-specific, fine-grained single-cell atlases (183,742 and 228,211 cells) and GWAS data (Ns of 97,173 and 45,975) for rheumatoid arthritis (RA) and ulcerative colitis (UC). Second, given the lack of scRNA-seq for many diseases with GWAS, we further tested the tools’ resolution limits by differentiating between similar diseases with only one fine-grained scRNA-seq atlas. Lastly, we provide a novel evaluation of noncoding SNP incorporation methods by testing which enabled the highest sensitivity/accuracy of known cell-state calls. ResultsWe first found that single-cell based tools scDRS and scPagwas called superior numbers of supported cell states that were overlooked by scGWAS. While scGWAS and scPagwas were advantageous for gene exploration, scDRS effectively accounted for batch effect and captured cellular heterogeneity of disease-relevance without single-cell genotyping. For noncoding SNP integration, we found a key trade-off between statistical power and confidence with positional (e.g. MAGMA) and non-positional approaches (e.g. chromatin-interaction, eQTL). Even when directly incorporating noncoding SNPs through 5’ scRNA-seq measures of regulatory elements, non disease-specific atlases gave misleading results by not containing disease-tissue specific transcriptomic patterns. Despite this criticality of tissue-specific scRNA-seq, we showed that scDRS enabled deconvolution of two similar diseases with a single fine-grained scRNA-seq atlas and separate GWAS. Indeed, we identified supported and novel genetic-phenotype linkages separating RA and ankylosing spondylitis, and UC and crohn’s disease. Overall, while noting evolving single-cell technologies, our study provides key findings for integrating expanding fine-grained scRNA-seq, GWAS, and noncoding SNP resources to unravel the complexities of inflammatory diseases.  more » « less
Award ID(s):
2022138
PAR ID:
10630681
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Pubmed
Date Published:
Journal Name:
Frontiers in Immunology
Volume:
15
ISSN:
1664-3224
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Genome-wide association studies (GWASs) have identified thousands of human genetic variants associated with diverse diseases and traits, and most of these variants map to noncoding loci with unknown target genes and function. Current approaches to understand which GWAS loci harbor causal variants and to map these noncoding regulators to target genes suffer from low throughput. With newer multiancestry GWASs from individuals of diverse ancestries, there is a pressing and growing need to scale experimental assays to connect GWAS variants with molecular mechanisms. Here, we combined biobank-scale GWASs, massively parallel CRISPR screens, and single-cell sequencing to discover target genes of noncoding variants for blood trait loci with systematic targeting and inhibition of noncoding GWAS loci with single-cell sequencing (STING-seq). RATIONALE Blood traits are highly polygenic, and GWASs have identified thousands of noncoding loci that map to candidate cis -regulatory elements (CREs). By combining CRE-silencing CRISPR perturbations and single-cell readouts, we targeted hundreds of GWAS loci in a single assay, revealing target genes in cis and in trans . For select CREs that regulate target genes, we performed direct variant insertion. Although silencing the CRE can identify the target gene, direct variant insertion can identify magnitude and direction of effect on gene expression for the GWAS variant. In select cases in which the target gene was a transcription factor or microRNA, we also investigated the gene-regulatory networks altered upon CRE perturbation and how these networks differ across blood cell types. RESULTS We inhibited candidate CREs from fine-mapped blood trait GWAS variants (from ~750,000 individual of diverse ancestries) in human erythroid progenitors. In total, we targeted 543 variants (254 loci) mapping to candidate CREs, generating multimodal single-cell data including transcriptome, direct CRISPR gRNA capture, and cell surface proteins. We identified target genes in cis (within 500 kb) for 134 CREs. In most cases, we found that the target gene was the closest gene and that specific enhancer-associated biochemical hallmarks (H3K27ac and accessible chromatin) are essential for CRE function. Using multiple perturbations at the same locus, we were able to distinguished between causal variants from noncausal variants in linkage disequilibrium. For a subset of validated CREs, we also inserted specific GWAS variants using base-editing STING-seq (beeSTING-seq) and quantified the effect size and direction of GWAS variants on gene expression. Given our transcriptome-wide data, we examined dosage effects in cis and trans in cases in which the cis target is a transcription factor or microRNA. We found that trans target genes are also enriched for GWAS loci, and identified gene clusters within trans gene networks with distinct biological functions and expression patterns in primary human blood cells. CONCLUSION In this work, we investigated noncoding GWAS variants at scale, identifying target genes in single cells. These methods can help to address the variant-to-function challenges that are a barrier for translation of GWAS findings (e.g., drug targets for diseases with a genetic basis) and greatly expand our ability to understand mechanisms underlying GWAS loci. Identifying causal variants and their target genes with STING-seq. Uncovering causal variants and their target genes or function are a major challenge for GWASs. STING-seq combines perturbation of noncoding loci with multimodal single-cell sequencing to profile hundreds of GWAS loci in parallel. This approach can identify target genes in cis and trans , measure dosage effects, and decipher gene-regulatory networks. 
    more » « less
  2. Multi-modal single cell RNA assays capture RNA content as well as other data modalities, such as spatial cell position or the electrophysiological properties of cells. Compared to dedicated scRNA-seq assays however, they may unintentionally capture RNA from multiple adjacent cells, exhibit lower RNA sequencing depth compared to scRNA-seq, or lack genome-wide RNA measurements. We present scProjection, a method for mapping individual multi-modal RNA measurements to deeply sequenced scRNA-seq atlases to extract cell type-specific, single cell gene expression profiles. We demonstrate several use cases of scProjection, including the identification of spatial motifs from spatial transcriptome assays, distinguishing RNA contributions from neighboring cells in both spatial and multi-modal single cell assays, and imputing expression measurements of un-measured genes from gene markers. scProjection therefore combines the advantages of both multi-modal and scRNA-seq assays to yield precise multi-modal measurements of single cells. 
    more » « less
  3. Blaser, Martin J (Ed.)
    ABSTRACT Haemophilus ducreyicauses the genital ulcer disease chancroid and cutaneous ulcers in children. To study its pathogenesis, we developed a human challenge model in which we infect the skin on the upper arm of human volunteers withH. ducreyito the pustular stage of disease. The model has been used to define lesional architecture, describe the immune infiltrate into the infected sites using flow cytometry, and explore the molecular basis of the immune response using bulk RNA-seq. Here, we used single cell RNA-seq (scRNA-seq) and spatial transcriptomics to simultaneously characterize multiple cell types within infected human skin and determine the cellular origin of differentially expressed transcripts that we had previously identified by bulk RNA-seq. We obtained paired biopsies of pustules and wounded (mock infected) sites from five volunteers for scRNA-seq. We identified 13 major cell types, including T- and NK-like cells, macrophages, dendritic cells, as well as other cell types typically found in the skin. Immune cell types were enriched in pustules, and some subtypes within the major cell types were exclusive to pustules. Sufficient tissue specimens for spatial transcriptomics were available from four of the volunteers. T- and NK-like cells were highly associated with multiple antigen presentation cell types. In pustules, type I interferon stimulation was high in areas that were high in antigen presentation—especially in macrophages near the abscess—compared to wounds. Together, our data provide a high-resolution view of the cellular immune response to the infection of the skin with a human pathogen.IMPORTANCEA high-resolution view of the immune infiltrate due to infection with an extracellular bacterial pathogen in human skin has not yet been defined. Here, we used the human skin pathogenHaemophilus ducreyiin a human challenge model to identify on a single cell level the types of cells that are present in volunteers who fail to spontaneously clear infection and form pustules. We identified 13 major cell types. Immune cells and immune-activated stromal cells were enriched in pustules compared to wounded (mock infected) sites. Pustules formed despite the expression of multiple pro-inflammatory cytokines, such as IL-1β and type I interferon. Interferon stimulation was most evident in macrophages, which were proximal to the abscess. The pro-inflammatory response within the pustule may be tempered by regulatory T cells and cells that express indoleamine 2,3-dioxygenase, leading to failure of the immune system to clearH. ducreyi. 
    more » « less
  4. Wren, Jonathan (Ed.)
    Abstract MotivationGene set enrichment (GSE) analysis allows for an interpretation of gene expression through pre-defined gene set databases and is a critical step in understanding different phenotypes. With the rapid development of single-cell RNA sequencing (scRNA-seq) technology, GSE analysis can be performed on fine-grained gene expression data to gain a nuanced understanding of phenotypes of interest. However, with the cellular heterogeneity in single-cell gene profiles, current statistical GSE analysis methods sometimes fail to identify enriched gene sets. Meanwhile, deep learning has gained traction in applications like clustering and trajectory inference in single-cell studies due to its prowess in capturing complex data patterns. However, its use in GSE analysis remains limited, due to interpretability challenges. ResultsIn this paper, we present DeepGSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE. DeepGSEA learns the ability to capture GSE information through our designed classification tasks, and significance tests can be performed on each gene set, enabling the identification of enriched sets. The underlying distribution of a gene set learned by DeepGSEA can be explicitly visualized using the encoded cell and cellular prototype embeddings. We demonstrate the performance of DeepGSEA over commonly used GSE analysis methods by examining their sensitivity and specificity with four simulation studies. In addition, we test our model on three real scRNA-seq datasets and illustrate the interpretability of DeepGSEA by showing how its results can be explained. Availability and implementationhttps://github.com/Teddy-XiongGZ/DeepGSEA 
    more » « less
  5. Abstract MotivationShort-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses. ResultsWe develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of many reads and identify the true gene origin of multi-gene mapped reads. Availability and implementationForseti and the code used for producing the results are available at https://github.com/COMBINE-lab/forseti under a BSD 3-clause license. 
    more » « less