The recent advances in the automation of metadata normalization and the invention of a unified schema --- Brick --- alleviate the metadata normalization challenge for deploying portable applications across buildings. Yet, the lack of compatibility between existing metadata normalization methods precludes the possibility of comparing and combining them. While generic machine learning (ML) frameworks, such as MLJAR and OpenML, provide versatile interfaces for standard ML problems, they cannot easily accommodate the metadata normalization tasks for buildings due to the heterogeneity in the inference scope, type of data required as input, evaluation metric, and the building-specific human-in-the-loop learning procedure. We propose Plaster, an open and modular framework that incorporates existing advances in building metadata normalization. It provides unified programming interfaces for various types of learning methods for metadata normalization and defines standardized data models for building metadata and timeseries data. Thus, it enables the integration of different methods via a workflow, benchmarking of different methods via unified interfaces, and rapid prototyping of new algorithms. With Plaster, we 1) show three examples of the workflow integration, delivering better performance than individual algorithms, 2) benchmark/analyze five algorithms over five common buildings, and 3) exemplify the process of developing a new algorithm involving time series features. We believe Plaster will facilitate the development of new algorithms and expedite the adoption of standard metadata schema such as Brick, in order to enable seamless smart building applications in the future. 
                        more » 
                        « less   
                    
                            
                            Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
                        
                    
    
            Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier’s reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier’s prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2125142
- PAR ID:
- 10478146
- Editor(s):
- Roy, Sushmita
- Publisher / Repository:
- PLoS
- Date Published:
- Journal Name:
- PLOS Computational Biology
- Volume:
- 19
- Issue:
- 10
- ISSN:
- 1553-7358
- Page Range / eLocation ID:
- e1010608
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Data integration is a powerful tool for facilitating a comprehensive and generalizable understanding of microbial communities and their association with outcomes of interest. However, integrating data sets from different studies remains a challenging problem because of severe batch effects, unobserved confounding variables, and high heterogeneity across data sets. We propose a new data integration method called MetaDICT, which initially estimates the batch effects by weighting methods in causal inference literature and then refines the estimation via a novel shared dictionary learning. Compared with existing methods, MetaDICT can better avoid the overcorrection of batch effects and preserve biological variation when there exist unobserved confounding variables or data sets are highly heterogeneous across studies. Furthermore, MetaDICT can generate comparable embedding at both taxa and sample levels that can be used to unravel the hidden structure of the integrated data and improve the integrative analysis. Applications to synthetic and real microbiome data sets demonstrate the robustness and effectiveness of MetaDICT in integrative analysis. Using MetaDICT, we characterize microbial interaction, identify generalizable microbial signatures, and enhance the accuracy of disease prediction in an integrative analysis of colorectal cancer metagenomics studies.more » « less
- 
            Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across processing batches. Such "batch effects" often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. In contrast to the typical approach of removing batch effects from the merged data, our method integrates predictions rather than data. We provide a systematic comparison between these two strategies, using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed a turning point in the level of heterogeneity, after which our strategy of integrating predictions yields better discrimination in independent validation than the traditional method of integrating the data. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.more » « less
- 
            A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) cross-study learning, which involves training a separate learner on each dataset and combining the resulting predictions. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than cross-study learning when the predictor-outcome relationships are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We provide analytic expressions for the transition point in various scenarios and study asymptotic properties.more » « less
- 
            Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed genes (DEG, p<0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    