Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies, as well as latent factors specific to individual studies. However, BMSFA does not allow for partially shared factors, i.e. latent factors shared by more than one but less than all studies. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. Finally, we apply Tetris to high-dimensional gene expression datasets to identify patterns in breast cancer gene expression, both within and across known classes defined by germline mutations. 
                        more » 
                        « less   
                    
                            
                            Bayesian Multi-study Factor Analysis for High-throughput Biological Data
                        
                    
    
            This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1810829
- PAR ID:
- 10293521
- Date Published:
- Journal Name:
- The annals of applied statistics
- Volume:
- 15
- Issue:
- 4
- ISSN:
- 1932-6157
- Page Range / eLocation ID:
- 1723 -- 1741
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Breast cancer progression is marked by extracellular matrix (ECM) remodeling, including increased stiffness, faster stress relaxation, and elevated collagen levels. In vitro experiments have revealed a role for each of these factors to individually promote malignant behavior, but their combined effects remain unclear. To address this, we developed alginate-collagen hydrogels with independently tunable stiffness, stress relaxation, and collagen density. We show that these combined tumor-mimicking ECM cues reinforced invasive morphologies and promoted spheroid invasion in breast cancer and mammary epithelial cells. High stiffness and low collagen density in slow-relaxing matrices led to the greatest cell migration speed and displacement. RNA-seq revealed Sp1 target gene enrichment in response to both individual and combined ECM cues, with a greater enrichment observed under multiple cues. Notably, high expression of Sp1 target genes upregulated by fast stress relaxation correlated with poor patient survival. Mechanistically, we found that phosphorylated-Sp1 (T453) was increasingly located in the nucleus in stiff and/or fast relaxing matrices, which was regulated by PI3K and ERK1/2 signaling, as well as actomyosin contractility. This study emphasizes how multiple ECM cues in complex microenvironments reinforce malignant traits and supports an emerging role for Sp1 as a mechanoresponsive transcription factor.more » « less
- 
            Abstract Motivation Detecting cancer gene expression and transcriptome changes with mRNA-sequencing (RNA-Seq) or array-based data are important for understanding the molecular mechanisms underlying carcinogenesis and cellular events during cancer progression. In previous studies, the differentially expressed genes were detected across patients in one cancer type. These studies ignored the role of mRNA expression changes in driving tumorigenic mechanisms that are either universal or specific in different tumor types. To address the problem, we introduce two network-based multi-task learning frameworks, NetML and NetSML, to discover common differentially expressed genes shared across different cancer types as well as differentially expressed genes specific to each cancer type. The proposed frameworks consider the common latent gene co-expression modules and gene-sample biclusters underlying the multiple cancer datasets to learn the knowledge crossing different tumor types. Results Large-scale experiments on simulations and real cancer high-throughput datasets validate that the proposed network-based multi-task learning frameworks perform better sample classification compared with the models without the knowledge sharing across different cancer types. The common and cancer specific molecular signatures detected by multi-task learning frameworks on TCGA ovarian cancer, breast cancer, and prostate cancer datasets are correlated with the known marker genes and enriched in cancer relevant KEGG pathways and Gene Ontology terms. Availability and Implementation Source code is available at: https://github.com/compbiolabucf/NetML Supplementary information Supplementary data are available at Bioinformaticsmore » « less
- 
            Abstract MotivationPredictive biological signatures provide utility as biomarkers for disease diagnosis and prognosis, as well as prediction of responses to vaccination or therapy. These signatures are identified from high-throughput profiling assays through a combination of dimensionality reduction and machine learning techniques. The genes, proteins, metabolites, and other biological analytes that compose signatures also generate hypotheses on the underlying mechanisms driving biological responses, thus improving biological understanding. Dimensionality reduction is a critical step in signature discovery to address the large number of analytes in omics datasets, especially for multi-omics profiling studies with tens of thousands of measurements. Latent factor models, which can account for the structural heterogeneity across diverse assays, effectively integrate multi-omics data and reduce dimensionality to a small number of factors that capture correlations and associations among measurements. These factors provide biologically interpretable features for predictive modeling. However, multi-omics integration and predictive modeling are generally performed independently in sequential steps, leading to suboptimal factor construction. Combining these steps can yield better multi-omics signatures that are more predictive while still being biologically meaningful. ResultsWe developed a supervised variational Bayesian factor model that extracts multi-omics signatures from high-throughput profiling datasets that can span multiple data types. Signature-based multiPle-omics intEgration via lAtent factoRs (SPEAR) adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. The method improves the reconstruction of underlying factors in synthetic examples and prediction accuracy of coronavirus disease 2019 severity and breast cancer tumor subtypes. Availability and implementationSPEAR is a publicly available R-package hosted at https://bitbucket.org/kleinstein/SPEAR.more » « less
- 
            null (Ed.)ABSTRACT Susceptibility to breast cancer is significantly increased in individuals with germ line mutations in RECQ1 (also known as RECQL or RECQL1 ), a gene encoding a DNA helicase essential for genome maintenance. We previously reported that RECQ1 expression predicts clinical outcomes for sporadic breast cancer patients stratified by estrogen receptor (ER) status. Here, we utilized an unbiased integrative genomics approach to delineate a cross talk between RECQ1 and ERα, a known master regulatory transcription factor in breast cancer. We found that expression of ESR1 , the gene encoding ERα, is directly activated by RECQ1. More than 35% of RECQ1 binding sites were cobound by ERα genome-wide. Mechanistically, RECQ1 cooperates with FOXA1, the pioneer transcription factor for ERα, to enhance chromatin accessibility at the ESR1 regulatory regions in a helicase activity-dependent manner. In clinical ERα-positive breast cancers treated with endocrine therapy, high RECQ1 and high FOXA1 coexpressing tumors were associated with better survival. Collectively, these results identify RECQ1 as a novel cofactor for ERα and uncover a previously unknown mechanism by which RECQ1 regulates disease-driving gene expression in ER-positive breast cancer cells.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    