NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Interpretable Model Summaries Using the Wasserstein Distance

https://doi.org/10.1080/00031305.2025.2551223

Dunipace, Eric; Trippa, Lorenzo (September 2025, The American Statistician)

Free, publicly-accessible full text available September 2, 2026
Comment: Advancing Clinical Trials with Novel Designs and Implementations

https://doi.org/10.1214/23-STS865C

Trippa, Lorenzo; Xu, Yanxun (May 2023, Statistical Science)

Full Text Available
Prediction of hereditary cancers using neural networks

https://doi.org/10.1214/21-AOAS1510

Guan, Zoe; Parmigiani, Giovanni; Braun, Danielle; Trippa, Lorenzo (March 2022, The Annals of Applied Statistics)

Full Text Available
Bayesian Multi-study Factor Analysis for High-throughput Biological Data

https://doi.org/21-AOAS1456

De Vito, Roberta; Bellio, Ruggero; Trippa, Lorenzo; Parmigiani, Giovanni (January 2021, The annals of applied statistics)

This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub.
more » « less
Full Text Available
Integration of Survival Data from Multiple Studies

Ventz, Steffen; Mazumdar, Rahul; Trippa, Lorenzo (July 2020, ArXivorg)

We introduce a statistical procedure that integrates survival data from multiple biomedical studies, to improve the accuracy of predictions of survival or other events, based on individual clinical and genomic profiles, compared to models developed leveraging only a single study or meta-analytic methods. The method accounts for potential differences in the relation between predictors and outcomes across studies, due to distinct patient populations, treatments and technologies to measure outcomes and biomarkers. These differences are modeled explicitly with study-specific parameters. We use hierarchical regularization to shrink the study-specific parameters towards each other and to borrow information across studies. Shrinkage of the study-specific parameters is controlled by a similarity matrix, which summarizes differences and similarities of the relations between covariates and outcomes across studies. We illustrate the method in a simulation study and using a collection of gene-expression datasets in ovarian cancer. We show that the proposed model increases the accuracy of survival prediction compared to alternative meta-analytic methods.
more » « less
Full Text Available
Bayesian Combinatorial Multi-Study Factor Analysis

Grabski, Isabella N; De Vito, Roberta; Trippa, Lorenzo; Parmigiani, Giovanni (July 2020, ArXivorg)

Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies, as well as latent factors specific to individual studies. However, BMSFA does not allow for partially shared factors, i.e. latent factors shared by more than one but less than all studies. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. Finally, we apply Tetris to high-dimensional gene expression datasets to identify patterns in breast cancer gene expression, both within and across known classes defined by germline mutations.
more » « less
Full Text Available
Cross-study Learning for Generalist and Specialist Predictions

Ren, Boyu; Patil, Prasad; Dominici, Francesca; Parmigiani, Giovanni; Trippa, Lorenzo (July 2020, ArXivorg)

Jointly using data from multiple similar sources for the training of prediction models is increasingly becoming an important task in many fields of science. In this paper, we propose a framework for {\it generalist and specialist} predictions that leverages multiple datasets, with potential heterogenity in the relationships between predictors and outcomes. Our framework uses ensembling with stacking, and includes three major components: 1) training of the ensemble members using one or more datasets, 2) a no-data-reuse technique for stacking weights estimation and 3) task-specific utility functions. We prove that under certain regularity conditions, our framework produces a stacked prediction function with oracle property. We also provide analytically the conditions under which the proposed no-data-reuse technique will increase the prediction accuracy of the stacked prediction function compared to using the full data. We perform a simulation study to numerically verify and illustrate these results and apply our framework to predicting mortality based on a collection of variables including long-term exposure to common air pollutants.
more » « less
Full Text Available
Integration of survival data from multiple studies

https://doi.org/10.1111/biom.13517

Ventz, Steffen; Mazumder, Rahul; Trippa, Lorenzo (September 2021, Biometrics)

Abstract We introduce a statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles. The proposed procedure accounts for potential differences in the relation between predictors and outcomes across studies, due to distinct patient populations, treatments and technologies to measure outcomes and biomarkers. These differences are modeled explicitly with study‐specific parameters. We use hierarchical regularization to shrink the study‐specific parameters towards each other and to borrow information across studies. The estimation of the study‐specific parameters utilizes a similarity matrix, which summarizes differences and similarities of the relations between covariates and outcomes across studies. We illustrate the method in a simulation study and using a collection of gene expression datasets in ovarian cancer. We show that the proposed model increases the accuracy of survival predictions compared to alternative meta‐analytic methods.
more » « less

Search for: All records