Accurate cancer subtype prediction is crucial for personalized medicine. Integrating multi-omics data represents a viable approach to comprehending the intricate pathophysiology of complex diseases like cancer. Conventional machine learning techniques are not ideal for analyzing the complex interrelationships among different categories of omics data. Numerous models have been suggested using graph-based learning to uncover veiled representations and network formations unique to distinct types of omics data to heighten predictions regarding cancers and characterize patients’ profiles, amongst other applications aimed at improving disease management in medical research. The existing graph-based state-of-the-art multi-omics integration approaches for cancer subtype prediction, MOGONET, and SUPREME, use a graph convolutional network (GCN), which fails to consider the level of importance of neighboring nodes on a particular node. To address this gap, we hypothesize that paying attention to each neighbor or providing appropriate weights to neighbors based on their importance might improve the cancer subtype prediction. The natural choice to determine the importance of each neighbor of a node in a graph is to explore the graph attention network (GAT). Here, we propose MOGAT, a novel multi-omics integration approach, leveraging GAT models that incorporate graph-based learning with an attention mechanism. MOGAT utilizes a multi-head attention mechanism to extract appropriate information for a specific sample by assigning unique attention coefficients to neighboring samples. Based on our knowledge, our group is the first to explore GAT in multi-omics integration for cancer subtype prediction. To evaluate the performance of MOGAT in predicting cancer subtypes, we explored two sets of breast cancer data from TCGA and METABRIC. Our proposed approach, MOGAT, outperforms MOGONET by 32% to 46% and SUPREME by 2% to 16% in cancer subtype prediction in different scenarios, supporting our hypothesis. Our results also showed that GAT embeddings provide a better prognosis in differentiating the high-risk group from the low-risk group than raw features.
more »
« less
Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data
Abstract Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without negative transfer effects often seen in existing multi-task learning and transfer learning methods.
more »
« less
- Award ID(s):
- 1812641
- PAR ID:
- 10096019
- Date Published:
- Journal Name:
- Advances in Neural Information Processing Systems 31 (NIPS 2018)
- Volume:
- 31
- Page Range / eLocation ID:
- 9133-9142
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Low-photon count imaging has been typically modeled by Poisson statistics. This discrete probability distribution model assumes that the mean and variance of a signal are equal. In the presence of greater variability in a dataset than what is expected, the negative binomial distribution is a suitable overdispersed alternative to the Poisson distribution. In this work, we present a framework for reconstructing sparse signals in these low-count overdispersed settings. Specifically, we describe a gradient-based sequential quadratic optimization approach that minimizes the negative log-likelihood corresponding to the negative binomial distribution coupled with a sparsity-promoting regularization term. Numerical experiments on 1D and 2D sparse/compressible signals are presented.more » « less
-
A transfer learning approach based on gradient boosting machine for diagnosis of Alzheimer’s diseaseEarly detection of Alzheimer’s disease (AD) during the Mild Cognitive Impairment (MCI) stage could enable effective intervention to slow down disease progression. Computer-aided diagnosis of AD relies on a sufficient amount of biomarker data. When this requirement is not fulfilled, transfer learning can be used to transfer knowledge from a source domain with more amount of labeled data than available in the desired target domain. In this study, an instance-based transfer learning framework is presented based on the gradient boosting machine (GBM). In GBM, a sequence of base learners is built, and each learner focuses on the errors (residuals) of the previous learner. In our transfer learning version of GBM (TrGB), a weighting mechanism based on the residuals of the base learners is defined for the source instances. Consequently, instances with different distribution than the target data will have a lower impact on the target learner. The proposed weighting scheme aims to transfer as much information as possible from the source domain while avoiding negative transfer. The target data in this study was obtained from the Mount Sinai dataset which is collected and processed in a collaborative 5-year project at the Mount Sinai Medical Center. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset was used as the source domain. The experimental results showed that the proposed TrGB algorithm could improve the classification accuracy by 1.5 and 4.5% for CN vs. MCI and multiclass classification, respectively, as compared to the conventional methods. Also, using the TrGB model and transferred knowledge from the CN vs. AD classification of the source domain, the average score of early MCI vs. late MCI classification improved by 5%.more » « less
-
Abstract MotivationPredictive biological signatures provide utility as biomarkers for disease diagnosis and prognosis, as well as prediction of responses to vaccination or therapy. These signatures are identified from high-throughput profiling assays through a combination of dimensionality reduction and machine learning techniques. The genes, proteins, metabolites, and other biological analytes that compose signatures also generate hypotheses on the underlying mechanisms driving biological responses, thus improving biological understanding. Dimensionality reduction is a critical step in signature discovery to address the large number of analytes in omics datasets, especially for multi-omics profiling studies with tens of thousands of measurements. Latent factor models, which can account for the structural heterogeneity across diverse assays, effectively integrate multi-omics data and reduce dimensionality to a small number of factors that capture correlations and associations among measurements. These factors provide biologically interpretable features for predictive modeling. However, multi-omics integration and predictive modeling are generally performed independently in sequential steps, leading to suboptimal factor construction. Combining these steps can yield better multi-omics signatures that are more predictive while still being biologically meaningful. ResultsWe developed a supervised variational Bayesian factor model that extracts multi-omics signatures from high-throughput profiling datasets that can span multiple data types. Signature-based multiPle-omics intEgration via lAtent factoRs (SPEAR) adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. The method improves the reconstruction of underlying factors in synthetic examples and prediction accuracy of coronavirus disease 2019 severity and breast cancer tumor subtypes. Availability and implementationSPEAR is a publicly available R-package hosted at https://bitbucket.org/kleinstein/SPEAR.more » « less
-
Aneuploidy, or an incorrect chromosome number, is ubiquitous among cancers. Whole-genome duplication, resulting in tetraploidy, often occurs during the evolution of aneuploid tumors. Cancers that evolve through a tetraploid intermediate tend to be highly aneuploid and are associated with poor patient prognosis. The identification and enrichment of tetraploid cells from mixed populations is necessary to understand the role these cells play in cancer progression. Dielectrophoresis (DEP), a label-free electrokinetic technique, can distinguish cells based on their intracellular properties when stimulated above 10 MHz, but DEP has not been shown to distinguish tetraploid and/or aneuploid cancer cells from mixed tumor cell populations. Here, we used high-frequency DEP to distinguish cell subpopulations that differ in ploidy and nuclear size under flow conditions. We used impedance analysis to quantify the level of voltage decay at high frequencies and its impact on the DEP force acting on the cell. High-frequency DEP distinguished diploid cells from tetraploid clones due to their size and intracellular composition at frequencies above 40 MHz. Our findings demonstrate that high-frequency DEP can be a useful tool for identifying and distinguishing subpopulations with nuclear differences to determine their roles in disease progression.more » « less
An official website of the United States government

