NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Interpretable Model Summaries Using the Wasserstein Distance

https://doi.org/10.1080/00031305.2025.2551223

Dunipace, Eric; Trippa, Lorenzo (September 2025, The American Statistician)

Free, publicly-accessible full text available September 2, 2026
Optimal ensemble construction for multistudy prediction with applications to mortality estimation

https://doi.org/10.1002/sim.10006

Loewinger, Gabriel; Nunez, Rolando Acosta; Mazumder, Rahul; Parmigiani, Giovanni (February 2024, Statistics in Medicine)

It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out‐of‐study prediction performance when datasets are heterogeneous. Theoretical and applied work has shownmultistudy ensemblingto be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two‐stagestackingstrategy which fits study‐specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model‐fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID‐attributable mortality, we proposeoptimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study‐specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID‐19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data‐driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID‐19 data application and in a range of simulation settings.
more » « less
Defining Replicability of Prediction Rules

https://doi.org/10.1214/23-STS891

Parmigiani, Giovanni (November 2023, Statistical Science)

Full Text Available
A pairwise strategy for imputing predictive features when combining multiple datasets

https://doi.org/10.1093/bioinformatics/btac839

Wu, Yujie; Ren, Boyu; Patil, Prasad (January 2023, Bioinformatics)
Wren, Jonathan (Ed.)
Abstract Motivation In the training of predictive models using high-dimensional genomic data, multiple studies’ worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies. Results We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability. Availability and implementation The code is available at https://github.com/YujieWuu/Pairwise_imputation. Supplementary information Supplementary information is available at Bioinformatics online.
more » « less
Full Text Available

Search for: All records