Abstract Clinical notes present a wealth of information for applications in the clinical domain, but heterogeneity across clinical institutions and settings presents challenges for their processing. The clinical natural language processing field has made strides in overcoming domain heterogeneity, while pretrained deep learning models present opportunities to transfer knowledge from one task to another. Pretrained models have performed well when transferred to new tasks; however, it is not well understood if these models generalize across differences in institutions and settings within the clinical domain. We explore if institution or setting specific pretraining is necessary for pretrained models to perform well when transferred to new tasks. We find no significant performance difference between models pretrained across institutions and settings, indicating that clinically pretrained models transfer well across such boundaries. Given a clinically pretrained model, clinical natural language processing researchers may forgo the time-consuming pretraining step without a significant performance drop.
more »
« less
Demystifying “drop-outs” in single-cell UMI data
Abstract Many existing pipelines for scRNA-seq data apply pre-processing steps such as normalization or imputation to account for excessive zeros or “drop-outs. Here, we extensively analyze diverse UMI data sets to show that clustering should be the foremost step of the workflow. We observe that most drop-outs disappear once cell-type heterogeneity is resolved, while imputing or normalizing heterogeneous data can introduce unwanted noise. We propose a novel framework HIPPO (Heterogeneity-Inspired Pre-Processing tOol) that leverages zero proportions to explain cellular heterogeneity and integrates feature selection with iterative clustering. HIPPO leads to downstream analysis with greater flexibility and interpretability compared to alternatives.
more »
« less
- Award ID(s):
- 1712933
- PAR ID:
- 10480493
- Publisher / Repository:
- Genome Biology
- Date Published:
- Journal Name:
- Genome Biology
- Volume:
- 21
- Issue:
- 1
- ISSN:
- 1474-760X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract We develop a convex‐optimization clustering algorithm for heterogeneous financial networks, in the presence of arbitrary or even adversarial outliers. In the stochastic block model with heterogeneity parameters, we penalize nodes whose degree exhibit unusual behavior beyond inlier heterogeneity. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. In absence of any assumption on outliers, they are shown not to hinder the clustering of the inliers. We test the performance of the algorithm on semi‐synthetic heterogenous networks reconstructed to match aggregate data on the Korean financial sector. Our method allows for recovery of sub‐sectors with significantly lower error rates compared to existing algorithms. For overlapping portfolio networks, we uncover a clustering structure supporting diversification effects in investment management.more » « less
-
Abstract Mutations in theTP53tumor suppressor gene occur in >80% of the triple-negative or basal-like breast cancer. To test whether neomorphic functions of specificTP53missense mutations contribute to phenotypic heterogeneity, we characterized phenotypes of non-transformed MCF10A-derived cell lines expressing the ten most common missense mutant p53 proteins and observed a wide spectrum of phenotypic changes in cell survival, resistance to apoptosis and anoikis, cell migration, invasion and 3D mammosphere architecture. The p53 mutants R248W, R273C, R248Q, and Y220C are the most aggressive while G245S and Y234C are the least, which correlates with survival rates of basal-like breast cancer patients. Interestingly, a crucial amino acid difference at one position—R273C vs. R273H—has drastic changes on cellular phenotype. RNA-Seq and ChIP-Seq analyses show distinct DNA binding properties of different p53 mutants, yielding heterogeneous transcriptomics profiles, and MD simulation provided structural basis of differential DNA binding of different p53 mutants. Integrative statistical and machine-learning-based pathway analysis on gene expression profiles with phenotype vectors across the mutant cell lines identifies quantitative association of multiple pathways including the Hippo/YAP/TAZ pathway with phenotypic aggressiveness. Further, comparative analyses of large transcriptomics datasets on breast cancer cell lines and tumors suggest that dysregulation of the Hippo/YAP/TAZ pathway plays a key role in driving the cellular phenotypes towards basal-like in the presence of more aggressive p53 mutants. Overall, our study describes distinct gain-of-function impacts on protein functions, transcriptional profiles, and cellular behaviors of different p53 missense mutants, which contribute to clinical phenotypic heterogeneity of triple-negative breast tumors.more » « less
-
Abstract Multimodal integration combines information from different sources or modalities to gain a more comprehensive understanding of a phenomenon. The challenges in multi-omics data analysis lie in the complexity, high dimensionality, and heterogeneity of the data, which demands sophisticated computational tools and visualization methods for proper interpretation and visualization of multi-omics data. In this paper, we propose a novel method, termed Orthogonal Multimodality Integration and Clustering (OMIC), for analyzing CITE-seq. Our approach enables researchers to integrate multiple sources of information while accounting for the dependence among them. We demonstrate the effectiveness of our approach using CITE-seq data sets for cell clustering. Our results show that our approach outperforms existing methods in terms of accuracy, computational efficiency, and interpretability. We conclude that our proposed OMIC method provides a powerful tool for multimodal data analysis that greatly improves the feasibility and reliability of integrated data.more » « less
-
ABSTRACT Stress drop is a fundamental parameter related to earthquake source physics, but is hard to measure accurately. To better understand how different factors influence stress-drop measurements, we compare two different methods using the Ridgecrest stress-drop validation data set: spectral decomposition (SD) and spectral ratio (SR), each with different processing options. We also examine the influence of spectral complexity on source parameter measurement. Applying the SD method, we find that frequency bandwidth and time-window length could influence spectral magnitude calibration, while depth-dependent attenuation is important to correctly map stress-drop variations. For the SR method, we find that the selected source model has limited influence on the measurements; however, the Boatwright model tends to produce smaller standard deviation and larger magnitude dependence than the Brune model. Variance reduction threshold, frequency bandwidth, and time-window length, if chosen within an appropriate parameter range, have limited influence on source parameter measurement. For both methods, wave type, attenuation correction, and spectral complexity strongly influence the result. The scale factor that quantifies the magnitude dependence of stress drop show large variations with different processing options, and earthquakes with complex source spectra deviating from the Brune-type source models tend to have larger scale factor than earthquakes without complexity. Based on these detailed comparisons, we make a few specific suggestions for data processing workflows that could help future studies of source parameters and interpretations.more » « less
An official website of the United States government

