skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Prediction in Cancer Genomics Using Topological Signatures and Machine Learning
Copy Number Aberrations, gains and losses of genomic regions, are a hallmark of cancer and can be experimentally detected using microarray comparative genomic hybridization (aCGH). In previous works, we developed a topology based method to analyze aCGH data whose output are regions of the genome where copy number is altered in patients with a predetermined cancer phenotype. We call this method Topological Analysis of array CGH (TAaCGH). Here we combine TAaCGH with machine learning techniques to build classifiers using copy number aberrations. We chose logistic regression on two different binary phenotypes related to breast cancer to illustrate this approach. The first case consists of patients with over-expression of the ERBB2 gene. Over-expression of ERBB2 is commonly regulated by a copy number gain in chromosome arm 17q. TAaCGH found the region 17q11-q22 associated with the phenotype and using logistic regression we reduced this region to 17q12-q21.31 correctly classifying 78% of the ERBB2 positive individuals (sensitivity) in a validation data set. We also analyzed over-expression in Estrogen Receptor (ER), a second phenotype commonly observed in breast cancer patients and found that the region 5p14.3-12 together with six full arms were associated with the phenotype. Our method identified 4p, 6p and 16q as the strongest predictors correctly classifying 76% of ER positives in our validation data set. However, for this set there was a significant increase in the false positive rate (specificity). We suggest that topological and machine learning methods can be combined for prediction of phenotypes using genetic data.  more » « less
Award ID(s):
1854770
PAR ID:
10172038
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Topological Data Analysis
Volume:
15
Page Range / eLocation ID:
247-276
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Przytycka, Teresa M. (Ed.)
    Copy-number aberrations (CNAs) are genetic alterations that amplify or delete the number of copies of large genomic segments. Although they are ubiquitous in cancer and, thus, a critical area of current cancer research, CNA identification from DNA sequencing data is challenging because it requires partitioning of the genome into complex segments with the same copy-number states that may not be contiguous. Existing segmentation algorithms address these challenges either by leveraging the local information among neighboring genomic regions, or by globally grouping genomic regions that are affected by similar CNAs across the entire genome. However, both approaches have limitations: overclustering in the case of local segmentation, or the omission of clusters corresponding to focal CNAs in the case of global segmentation. Importantly, inaccurate segmentation will lead to inaccurate identification of CNAs. For this reason, most pan-cancer research studies rely on manual procedures of quality control and anomaly correction. To improve copy-number segmentation, we introduce CNAV iz , a web-based tool that enables the user to simultaneously perform local and global segmentation, thus overcoming the limitations of each approach. Using simulated data, we demonstrate that by several metrics, CNAV iz allows the user to obtain more accurate segmentation relative to existing local and global segmentation methods. Moreover, we analyze six bulk DNA sequencing samples from three breast cancer patients. By validating with parallel single-cell DNA sequencing data from the same samples, we show that by using CNAV iz , our user was able to obtain more accurate segmentation and improved accuracy in downstream copy-number calling. 
    more » « less
  2. Copy number changes play an important role in the development of cancer and are commonly associated with changes in gene expression. Persistence curves, such as Betti curves, have been used to detect copy number changes; however, it is known these curves are unstable with respect to small perturbations in the data. We address the stability of lifespan and Betti curves by providing bounds on the distance between persistence curves of Vietoris–Rips filtrations built on data and slightly perturbed data in terms of the bottleneck distance. Next, we perform simulations to compare the predictive ability of Betti curves, lifespan curves (conditionally stable) and stable persistent landscapes to detect copy number aberrations. We use these methods to identify significant chromosome regions associated with the four major molecular subtypes of breast cancer: Luminal A, Luminal B, Basal and HER2 positive. Identified segments are then used as predictor variables to build machine learning models which classify patients as one of the four subtypes. We find that no single persistence curve outperforms the others and instead suggest a complementary approach using a suite of persistence curves. In this study, we identified new cytobands associated with three of the subtypes: 1q21.1-q25.2, 2p23.2-p16.3, 23q26.2-q28 with the Basal subtype, 8p22-p11.1 with Luminal B and 2q12.1-q21.1 and 5p14.3-p12 with Luminal A. These segments are validated by the TCGA BRCA cohort dataset except for those found for Luminal A. 
    more » « less
  3. Background: Neoadjuvant chemotherapy (NACT) is an increasingly used approach for treatment of breast cancer. The pathological complete response (pCR) is considered a good predictor of disease-specific survival. This study investigated whether circulating exosomal microRNAs could predict pCR in breast cancer patients treated with NACT. Method: Plasma samples of 20 breast cancer patients treated with NACT were collected prior to and after the first cycle. RNA sequencing was used to determine microRNA profiling. The Cancer Genome Atlas (TCGA) was used to explore the expression patterns and survivability of the candidate miRNAs, and their potential targets based on the expression levels and copy number variation (CNV) data. Results: Three miRNAs before that NACT (miR-30b, miR-328 and miR-423) predicted pCR in all of the analyzed samples. Upregulation of miR-127 correlated with pCR in triple-negative breast cancer (TNBC). After the first NACT dose, pCR was predicted by exo-miR-141, while miR-34a, exo-miR182, and exo-miR-183 predicted non-pCR. A significant correlation between the candidate miRNAs and the overall survival, subtype, and metastasis in breast cancer, suggesting their potential role as predictive biomarkers of pCR. Conclusions: If the miRNAs identified in this study are validated in a large cohort of patients, they might serve as predictive non-invasive liquid biopsy biomarkers for monitoring pCR to NACT in breast cancer. 
    more » « less
  4. Breast cancer is highly sporadic and heterogeneous in nature. Even the patients with same clinical stage do not cluster together in terms of genomic profiles such as mRNA expression. In order to prevent and cure breast cancer completely, it is essential to decipher the detailed heterogeneity of breast cancer at genomic level. Putting the cancer patients on a time scale, which represents the trajectory of cancer development, may help discover the detailed heterogeneity. This in turn would help establish the mechanisms for prevention and complete cure of breast cancer. The goal of this study is to discover the heterogeneity of breast cancer by ordering the cancer patients using pseudotime. This is achieved through two objectives: First, a computational framework is developed to place the cancer patients on a time scale, meaning construct a trajectory of cancer development, by inferring pseudotime from static mRNA expression data; Second, discovering breast cancer heterogeneity at different time periods of the trajectory using statistical and machine learning techniques. In this study, the trajectory of breast cancer progression was constructed using static mRNA expression profiles of 1072 breast cancer patients by inferring pseudotime. Three sets of key genes discovered using supervised machine learning techniques are used to develop the trajectories. The first set of genes are PAM50 genes which is available in literature. The second and third sets of genes were discovered in the present study using the clinical stages of breast cancer (Stage-I, Stage-II, Stage-III, and Stage-IV). The proposed computational framework has the capability of deciphering heterogeneity in breast cancer at a granular level. The results also show the existence of multiple parallel trajectories at different time periods of cancer development or progression. 
    more » « less
  5. Background: We investigated the association between reproductive risk factors and breast cancer subtype in Black women. On the basis of the previous literature, we hypothesized that the relative prevalence of specific breast cancer subtypes might differ according to reproductive factors. Methods: We conducted a pooled analysis of 2,188 (591 premenopausal, 1,597 postmenopausal) Black women with a primary diagnosis of breast cancer from four studies in the southeastern United States. Breast cancers were classified by clinical subtype. Case-only polytomous logistic regression models were used to estimate ORs and 95% confidence intervals (CI) for HER2+ and triple-negative breast cancer (TNBC) status in relation to estrogen receptor–positive (ER+)/HER2− status (referent) for reproductive risk factors. Results: Relative to women who had ER+/HER2− tumors, women who were age 19–24 years at first birth (OR, 1.78; 95% CI, 1.22–2.59) were more likely to have TNBC. Parous women were less likely to be diagnosed with HER2+ breast cancer and more likely to be diagnosed with TNBC relative to ER+/HER2− breast cancer. Postmenopausal parous women who breastfed were less likely to have TNBC [OR, 0.65 (95% CI, 0.43–0.99)]. Conclusions: This large pooled study of Black women with breast cancer revealed etiologic heterogeneity among breast cancer subtypes. Impact: Black parous women who do not breastfeed are more likely to be diagnosed with TNBC, which has a worse prognosis, than with ER+/HER2− breast cancer. 
    more » « less