skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Demystifying “drop-outs” in single-cell UMI data
Abstract

Many existing pipelines for scRNA-seq data apply pre-processing steps such as normalization or imputation to account for excessive zeros or “drop-outs. Here, we extensively analyze diverse UMI data sets to show that clustering should be the foremost step of the workflow. We observe that most drop-outs disappear once cell-type heterogeneity is resolved, while imputing or normalizing heterogeneous data can introduce unwanted noise. We propose a novel framework HIPPO (Heterogeneity-Inspired Pre-Processing tOol) that leverages zero proportions to explain cellular heterogeneity and integrates feature selection with iterative clustering. HIPPO leads to downstream analysis with greater flexibility and interpretability compared to alternatives.

 
more » « less
Award ID(s):
1712933
NSF-PAR ID:
10480493
Author(s) / Creator(s):
; ;
Publisher / Repository:
Genome Biology
Date Published:
Journal Name:
Genome Biology
Volume:
21
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Clinical notes present a wealth of information for applications in the clinical domain, but heterogeneity across clinical institutions and settings presents challenges for their processing. The clinical natural language processing field has made strides in overcoming domain heterogeneity, while pretrained deep learning models present opportunities to transfer knowledge from one task to another. Pretrained models have performed well when transferred to new tasks; however, it is not well understood if these models generalize across differences in institutions and settings within the clinical domain. We explore if institution or setting specific pretraining is necessary for pretrained models to perform well when transferred to new tasks. We find no significant performance difference between models pretrained across institutions and settings, indicating that clinically pretrained models transfer well across such boundaries. Given a clinically pretrained model, clinical natural language processing researchers may forgo the time-consuming pretraining step without a significant performance drop.

     
    more » « less
  2. Abstract

    We develop a convex‐optimization clustering algorithm for heterogeneous financial networks, in the presence of arbitrary or even adversarial outliers. In the stochastic block model with heterogeneity parameters, we penalize nodes whose degree exhibit unusual behavior beyond inlier heterogeneity. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. In absence of any assumption on outliers, they are shown not to hinder the clustering of the inliers. We test the performance of the algorithm on semi‐synthetic heterogenous networks reconstructed to match aggregate data on the Korean financial sector. Our method allows for recovery of sub‐sectors with significantly lower error rates compared to existing algorithms. For overlapping portfolio networks, we uncover a clustering structure supporting diversification effects in investment management.

     
    more » « less
  3. Abstract

    Earthquake stress drop is an important source parameter that directly links to strong ground motion and fundamental questions in earthquake physics. Stress drop estimations may contain significant uncertainties due to such factors as variations in material properties and data limitations, which limit the applications of stress drop interpretations. Using a high‐resolution borehole network, we estimate stress drop for 4551 (M0‐4) earthquakes on the San Andreas Fault at Parkfield, California, between 2001 and 2016 using spectral decomposition and an improved stacking method. To evaluate the influence of spatiotemporal variations of material properties on stress drop estimations, we apply different strategies to account for spatial variations of velocity and attenuation changes, and divide earthquakes into three separate time periods to correct temporal variations of attenuation. These results show that appropriate corrections can significantly reduce the scatter in stress drop estimates, and decrease apparent depth and magnitude dependence. We find that insufficient bandwidth can cause systematic underestimation of stress drop estimates and increased scatter. The stress drop measurements from the high‐frequency borehole recordings exhibit complex stable spatial patterns with no clear correlation with the nature of fault slip, or the slip distribution of the 2004 M6 earthquake. Temporal variations are significantly smaller, less well resolved and varying spatially. They do not affect the long‐term stress drop spatial variations, suggesting local material properties may control the spatial heterogeneity of stress drop.

     
    more » « less
  4. Abstract

    Multimodal integration combines information from different sources or modalities to gain a more comprehensive understanding of a phenomenon. The challenges in multi-omics data analysis lie in the complexity, high dimensionality, and heterogeneity of the data, which demands sophisticated computational tools and visualization methods for proper interpretation and visualization of multi-omics data. In this paper, we propose a novel method, termed Orthogonal Multimodality Integration and Clustering (OMIC), for analyzing CITE-seq. Our approach enables researchers to integrate multiple sources of information while accounting for the dependence among them. We demonstrate the effectiveness of our approach using CITE-seq data sets for cell clustering. Our results show that our approach outperforms existing methods in terms of accuracy, computational efficiency, and interpretability. We conclude that our proposed OMIC method provides a powerful tool for multimodal data analysis that greatly improves the feasibility and reliability of integrated data.

     
    more » « less
  5. Abstract

    While technologies for multiplexed imaging have provided an unprecedented understanding of tissue composition in health and disease, interpreting this data remains a significant computational challenge. To understand the spatial organization of tissue and how it relates to disease processes, imaging studies typically focus on cell-level phenotypes. However, images can capture biologically important objects that are outside of cells, such as the extracellular matrix. Here, we describe a pipeline, Pixie, that achieves robust and quantitative annotation of pixel-level features using unsupervised clustering and show its application across a variety of biological contexts and multiplexed imaging platforms. Furthermore, current cell phenotyping strategies that rely on unsupervised clustering can be labor intensive and require large amounts of manual cluster adjustments. We demonstrate how pixel clusters that lie within cells can be used to improve cell annotations. We comprehensively evaluate pre-processing steps and parameter choices to optimize clustering performance and quantify the reproducibility of our method. Importantly, Pixie is open source and easily customizable through a user-friendly interface.

     
    more » « less