skip to main content


Search for: All records

Creators/Authors contains: "Parmigiani, Giovanni"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. null (Ed.)
  2. This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub. 
    more » « less
  3. null (Ed.)
    The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data. 
    more » « less
  4. Jointly using data from multiple similar sources for the training of prediction models is increasingly becoming an important task in many fields of science. In this paper, we propose a framework for {\it generalist and specialist} predictions that leverages multiple datasets, with potential heterogenity in the relationships between predictors and outcomes. Our framework uses ensembling with stacking, and includes three major components: 1) training of the ensemble members using one or more datasets, 2) a no-data-reuse technique for stacking weights estimation and 3) task-specific utility functions. We prove that under certain regularity conditions, our framework produces a stacked prediction function with oracle property. We also provide analytically the conditions under which the proposed no-data-reuse technique will increase the prediction accuracy of the stacked prediction function compared to using the full data. We perform a simulation study to numerically verify and illustrate these results and apply our framework to predicting mortality based on a collection of variables including long-term exposure to common air pollutants. 
    more » « less
  5. Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies, as well as latent factors specific to individual studies. However, BMSFA does not allow for partially shared factors, i.e. latent factors shared by more than one but less than all studies. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. Finally, we apply Tetris to high-dimensional gene expression datasets to identify patterns in breast cancer gene expression, both within and across known classes defined by germline mutations. 
    more » « less
  6. We investigate the power of censoring techniques, first developed for learning {\em fair representations}, to address domain generalization. We examine {\em adversarial} censoring techniques for learning invariant representations from multiple "studies" (or domains), where each study is drawn according to a distribution on domains. The mapping is used at test time to classify instances from a new domain. In many contexts, such as medical forecasting, domain generalization from studies in populous areas (where data are plentiful), to geographically remote populations (for which no training data exist) provides fairness of a different flavor, not anticipated in previous work on algorithmic fairness. We study an adversarial loss function for k domains and precisely characterize its limiting behavior as k grows, formalizing and proving the intuition, backed by experiments, that observing data from a larger number of domains helps. The limiting results are accompanied by non-asymptotic learning-theoretic bounds. Furthermore, we obtain sufficient conditions for good worst-case prediction performance of our algorithm on previously unseen domains. Finally, we decompose our mappings into two components and provide a complete characterization of invariance in terms of this decomposition. To our knowledge, our results provide the first formal guarantees of these kinds for adversarial invariant domain generalization. 
    more » « less
  7. Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across processing batches. Such "batch effects" often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. In contrast to the typical approach of removing batch effects from the merged data, our method integrates predictions rather than data. We provide a systematic comparison between these two strategies, using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed a turning point in the level of heterogeneity, after which our strategy of integrating predictions yields better discrimination in independent validation than the traditional method of integrating the data. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. 
    more » « less
  8. A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) cross-study learning, which involves training a separate learner on each dataset and combining the resulting predictions. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than cross-study learning when the predictor-outcome relationships are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We provide analytic expressions for the transition point in various scenarios and study asymptotic properties. 
    more » « less