skip to main content

Search for: All records

Creators/Authors contains: "Engelhardt, Barbara E"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples’ spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.

    more » « less
  2. Abstract Background

    Single-cell RNA-sequencing (scRNA-seq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cell-specific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, count-based scRNA-seq data pose challenges for traditional models based on Gaussian assumptions.


    We aim to resolve these issues by developing a reduced-rank regression model that identifies low-dimensional linear associations between a large number of cell-specific covariates and high-dimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNA-seq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNA-seq dataset, a spatial gene expression dataset, and a bulk RNA-seq dataset to show its behavior in three distinct analyses.


    We show that our statistical modeling approach, which is based on reduced-rank regression, captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcriptional states.

    more » « less
  3. Abstract

    Nonnegative matrix factorization (NMF) is widely used to analyze high-dimensional count data because, in contrast to real-valued alternatives such as factor analysis, it produces an interpretable parts-based representation. However, in applications such as spatial transcriptomics, NMF fails to incorporate known structure between observations. Here, we present nonnegative spatial factorization (NSF), a spatially-aware probabilistic dimension reduction model based on transformed Gaussian processes that naturally encourages sparsity and scales to tens of thousands of observations. NSF recovers ground truth factors more accurately than real-valued alternatives such as MEFISTO in simulations, and has lower out-of-sample prediction error than probabilistic NMF on three spatial transcriptomics datasets from mouse brain and liver. Since not all patterns of gene expression have spatial correlations, we also propose a hybrid extension of NSF that combines spatial and nonspatial components, enabling quantification of spatial importance for both observations and features. A TensorFlow implementation of NSF is available from

    more » « less
  4. Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene–variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at . 
    more » « less
  5. Multicellular organisms rely on spatial signaling among cells to drive their organization, development, and response to stimuli. Several models have been proposed to capture the behavior of spatial signaling in multicellular systems, but existing approaches fail to capture both the autonomous behavior of single cells and the interactions of a cell with its neighbors simultaneously. We propose a spatiotemporal model of dynamic cell signaling based on Hawkes processes—self-exciting point processes—that model the signaling processes within a cell and spatial couplings between cells. With this cellular point process (CPP), we capture both the single-cell pathway activation rate and the magnitude and duration of signaling between cells relative to their spatial location. Furthermore, our model captures tissues composed of heterogeneous cell types with different bursting rates and signaling behaviors across multiple signaling proteins. We apply our model to epithelial cell systems that exhibit a range of autonomous and spatial signaling behaviors basally and under pharmacological exposure. Our model identifies known drug-induced signaling deficits, characterizes signaling changes across a wound front, and generalizes to multichannel observations. 
    more » « less
  6. Abstract

    Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.

    more » « less
  7. Abstract

    Histopathological images are used to characterize complex phenotypes such as tumor stage. Our goal is to associate features of stained tissue images with high-dimensional genomic markers. We use convolutional autoencoders and sparse canonical correlation analysis (CCA) on paired histological images and bulk gene expression to identify subsets of genes whose expression levels in a tissue sample correlate with subsets of morphological features from the corresponding sample image. We apply our approach, ImageCCA, to two TCGA data sets, and find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. We find sets of genes associated with specific cell types, including neuronal cells and cells of the immune system. We apply ImageCCA to the GTEx v6 data, and find image features that capture population variation in thyroid and in colon tissues associated with genetic variants (image morphology QTLs, or imQTLs), suggesting that genetic variation regulates population variation in tissue morphological traits.

    more » « less
  8. null (Ed.)
    Online algorithms for detecting changepoints, or abrupt shifts in the behavior of a time series, are often deployed with limited resources, e.g., to edge computing settings such as mobile phones or industrial sensors. In these scenarios it may be beneficial to trade the cost of collecting an environmental measurement against the quality or "fidelity" of this measurement and how the measurement affects changepoint estimation. For instance, one might decide between inertial measurements or GPS to determine changepoints for motion. A Bayesian approach to changepoint detection is particularly appealing because we can represent our posterior uncertainty about changepoints and make active, cost-sensitive decisions about data fidelity to reduce this posterior uncertainty. Moreover, the total cost could be dramatically lowered through active fidelity switching, while remaining robust to changes in data distribution. We propose a multi-fidelity approach that makes cost-sensitive decisions about which data fidelity to collect based on maximizing information gain with respect to changepoints. We evaluate this framework on synthetic, video, and audio data and show that this information-based approach results in accurate predictions while reducing total cost. 
    more » « less