In recent years, incomplete multi-view clustering (IMVC), which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Previous IMVC methods suffer from the following issues: (1) the inaccurate imputation for missing data, which leads to suboptimal clustering performance, and (2) most existing IMVC models merely consider the explicit presence of graph structure in data, ignoring the fact that latent graphs of different views also provide valuable information for the clustering task. To overcome such challenges, we present a novel method, termed Adaptive feature imputation with latent graph for incomplete multi-view clustering (AGDIMC). Specifically, it captures the embbedded features of each view by incorporating the view-specific deep encoders. Then, we construct partial latent graphs on complete data, which can consolidate the intrinsic relationships within each view while preserving the topological information. With the aim of estimating the missing sample based on the available information, we utilize an adaptive imputation layer to impute the embedded feature of missing data by using cross-view soft cluster assignments and global cluster centroids. As the imputation progresses, the portion of complete data increases, contributing to enhancing the discriminative information contained in global pseudo-labels. Meanwhile, to alleviate the negative impact caused by inferior impute samples and the discrepancy of cluster structures, we further design an adaptive imputation strategy based on the global pseudo-label and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches.
more »
« less
Multi-omics Integrative Analysis for Incomplete Data Using Weighted p-Value Adjustment Approaches
Abstract The advancements in high-throughput technologies provide exciting opportunities to obtain multi-omics data from the same individuals in a biomedical study, and joint analyses of data from multiple sources offer many benefits. However, the occurrence of missing values is an inevitable issue in multi-omics data because measurements such as mRNA gene expression levels often require invasive tissue sampling from patients. Common approaches for addressing missing measurements include analyses based on observations with complete data or multiple imputation methods. In this paper, we propose a novel integrative multi-omics analytical framework based onp-value weight adjustment in order to incorporate observations with incomplete data into the analysis. By splitting the data into a complete set with full information and an incomplete set with missing measurements, we introduce mechanisms to derive weights and weight-adjustedp-values from the two sets. Through simulation analyses, we demonstrate that the proposed framework achieves considerable statistical power gains compared to a complete case analysis or multiple imputation approaches. We illustrate the implementation of our proposed framework in a study of preterm infant birth weights by a joint analysis of DNA methylation, mRNA, and the phenotypic outcome. Supplementary materials accompanying this paper appear online.
more »
« less
- Award ID(s):
- 2318925
- PAR ID:
- 10534150
- Publisher / Repository:
- Springer Nature
- Date Published:
- Journal Name:
- Journal of Agricultural, Biological and Environmental Statistics
- ISSN:
- 1085-7117
- Subject(s) / Keyword(s):
- Weighted p-value adjustment Missing value Incomplete Data Integrative multi-omics analysis Omnibus test
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper develops a query-time missing value imputation framework, entitled ZIP, that modifies relational operators to be imputation aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join based approach to preserve missing values during execution, and a bloom filter based index to optimize the space and running overhead. Extensive experiments on both real and synthetic data sets demonstrate 10 to 25 times improvement when augmenting the state-of-the-art technology, ImputeDB, with ZIP-based deferred imputation. ZIP also outperforms the offline approach by up to 19607 times in a real data set.more » « less
-
Machine learning with missing data has been approached in two different ways, including feature imputation where missing feature values are estimated based on observed values and label prediction where downstream labels are learned directly from incomplete data. However, existing imputation models tend to have strong prior assumptions and cannot learn from downstream tasks, while models targeting label prediction often involve heuristics and can encounter scalability issues. Here we propose GRAPE, a graph-based framework for feature imputation as well as label prediction. GRAPE tackles the missing data problem using a graph representation, where the observations and features are viewed as two types of nodes in a bipartite graph, and the observed feature values as edges. Under the GRAPE framework, the feature imputation is formulated as an edge-level prediction task and the label prediction as a node-level prediction task. These tasks are then solved with Graph Neural Networks. Experimental results on nine benchmark datasets show that GRAPE yields 20% lower mean absolute error for imputation tasks and 10% lower for label prediction tasks, compared with existing state-of-the-art methods.more » « less
-
As a pervasive issue, missing data may influence the data modeling performance and lead to more difficulties of completing the desired tasks. Many approaches have been developed for missing data imputation. Recently, by taking advantage of the emerging generative adversarial network (GAN), an effective missing data imputation approach termed generative adversarial imputation nets (GAIN) was developed. However, its modeling architecture may still lead to significant imputation bias. In addition, with the GAN structure, the training process of GAIN may be unstable and the imputation variation may be high. Hence, to address these two limitations, the ensemble GAIN with selective multi-generator (ESM-GAIN) is proposed to improve the imputation accuracy and robustness. The contributions of the proposed ESM-GAIN consist of two aspects: (1) a selective multi-generation framework is proposed to identify high-quality imputations; (2) an ensemble learning framework is incorporated for GAIN imputation to improve the imputation robustness. The effectiveness of the proposed ESM-GAIN is validated by both numerical simulation and two real-world breast cancer datasets.more » « less
-
null (Ed.)Spatial classification with limited observations is important in geographical applications where only a subset of sensors are deployed at certain spots or partial responses are collected in field surveys. For example, in observation-based flood inundation mapping, there is a need to map the full flood extent on geographic terrains based on earth imagery that partially covers a region. Existing research mostly focuses on addressing incomplete or missing data through data cleaning and imputation or modeling missing values as hidden variables in the EM algorithm. These methods, however, assume that missing feature observations are rare and thus are ineffective in problems whereby the vast majority of feature observations are missing. To address this issue, we recently proposed a new approach that incorporates physics-aware structural constraint into the model representation. We design efficient learning and inference algorithms. This paper extends our recent approach by allowing feature values of samples in each class to follow a multi-modal distribution. Evaluations on real-world flood mapping applications show that our approach significantly outperforms baseline methods in classification accuracy, and the multi-modal extension is more robust than our early single-modal version. Computational experiments show that the proposed solution is computationally efficient on large datasets.more » « less