skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CNRein: an evolution-aware deep reinforcement learning algorithm for single-cell DNA copy number calling
Abstract Low-pass single-cell DNA sequencing technologies and algorithmic advancements have enabled haplotype-specific copy number calling on thousands of cells within tumors. However, measurement uncertainty may result in spurious CNAs inconsistent with realistic evolutionary constraints. We introduce evolution-aware copy number calling via deep reinforcement learning (CNRein). Our simulations demonstrate CNRein infers more accurate copy-number profiles and better recapitulates ground truth clonal structure than existing methods. On sequencing data of breast and ovarian cancer, CNRein produces more parsimonious solutions than existing methods while maintaining agreement with single-nucleotide variants. Additionally, CNRein shows consistency on a breast cancer patient sequenced with distinct low-pass technologies.  more » « less
Award ID(s):
2046488
PAR ID:
10581465
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Genome Biology
Volume:
26
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Przytycka, Teresa M. (Ed.)
    Copy-number aberrations (CNAs) are genetic alterations that amplify or delete the number of copies of large genomic segments. Although they are ubiquitous in cancer and, thus, a critical area of current cancer research, CNA identification from DNA sequencing data is challenging because it requires partitioning of the genome into complex segments with the same copy-number states that may not be contiguous. Existing segmentation algorithms address these challenges either by leveraging the local information among neighboring genomic regions, or by globally grouping genomic regions that are affected by similar CNAs across the entire genome. However, both approaches have limitations: overclustering in the case of local segmentation, or the omission of clusters corresponding to focal CNAs in the case of global segmentation. Importantly, inaccurate segmentation will lead to inaccurate identification of CNAs. For this reason, most pan-cancer research studies rely on manual procedures of quality control and anomaly correction. To improve copy-number segmentation, we introduce CNAV iz , a web-based tool that enables the user to simultaneously perform local and global segmentation, thus overcoming the limitations of each approach. Using simulated data, we demonstrate that by several metrics, CNAV iz allows the user to obtain more accurate segmentation relative to existing local and global segmentation methods. Moreover, we analyze six bulk DNA sequencing samples from three breast cancer patients. By validating with parallel single-cell DNA sequencing data from the same samples, we show that by using CNAV iz , our user was able to obtain more accurate segmentation and improved accuracy in downstream copy-number calling. 
    more » « less
  2. Somatic copy number alterations (sCNAs) are valuable phylogenetic markers for inferring evolutionary relationships among tumor cell subpopulations. Advances in single-cell DNA sequencing technologies are making it possible to obtain such sCNAs datasets at ever-larger scales. However, existing methods for reconstructing phylogenies from sCNAs are often too slow for large datasets. We propose two new distance-based methods,DICE-barandDICE-star, for reconstructing single-cell tumor phylogenies from sCNA data. Using carefully simulated datasets, we find that DICE-bar matches or exceeds the accuracies of all other methods on noise-free datasets and that DICE-star shows exceptional robustness to noise and outperforms all other methods on noisy datasets. Both methods are also orders of magnitude faster than many existing methods. Our experimental analysis also reveals how noise/error in copy number inference, as expected for real datasets, can drastically impact the accuracies of most methods. We apply DICE-star, the most accurate method on error-prone datasets, to several real single-cell breast and ovarian cancer datasets and find that it rapidly produces phylogenies of equivalent or greater reliability compared with existing methods. 
    more » « less
  3. Abstract Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor’s mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss’ improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics. 
    more » « less
  4. Abstract MotivationSingle-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. ResultsHere, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. Availability and implementationPhylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar. 
    more » « less
  5. Abstract MotivationAdvances in whole-genome single-cell DNA sequencing (scDNA-seq) have led to the development of numerous methods for detecting copy number aberrations (CNAs), a key driver of genetic heterogeneity in cancer. While most of these methods are limited to the inference of total copy number, some recent approaches now infer allele-specific CNAs using innovative techniques for estimating allele-frequencies in low coverage scDNA-seq data. However, these existing allele-specific methods are limited in their segmentation strategies, a crucial step in the CNA detection pipeline. ResultsWe present SEACON (Single-cell Estimation of Allele-specific COpy Numbers), an allele-specific copy number profiler for scDNA-seq data. SEACON uses a Gaussian Mixture Model to identify latent copy number states and breakpoints between contiguous segments across cells, filters the segments for high-quality breakpoints using an ensemble technique, and adopts several strategies for tolerating noisy read-depth and allele frequency measurements. Using a wide array of both real and simulated datasets, we show that SEACON derives accurate copy numbers and surpasses existing approaches under numerous experimental conditions, and identify its strengths and weaknesses. Availability and implementationSEACON is implemented in Python and is freely available open-source from https://github.com/NabaviLab/SEACON and https://doi.org/10.5281/zenodo.12727008. 
    more » « less