Inspired by recent efforts to model cancer evolution with phylogenetic trees, we consider the problem of finding a consensus tumor evolution tree from a set of conflicting input trees. In contrast to traditional phylogenetic trees, the tumor trees we consider contain features such as mutation labels on internal vertices (in addition to the leaves) and allow multiple mutations to label a single vertex. We describe several distance measures between these tumor trees and present an algorithm to solve the consensus problem called GraPhyC. Our approach uses a weighted directed graph where vertices are sets of mutations and edges are weighted using a function that depends on the number of times a parental relationship is observed between their constituent mutations in the set of input trees. We find a minimum weight spanning arborescence in this graph and prove that the resulting tree minimizes the total distance to all input trees for one of our presented distance measures. We evaluate our GraPhyC method using both simulated and real data. On simulated data we show that our method outperforms a baseline method at finding an appropriate representative tree. Using a set of tumor trees derived from both whole-genome and deep sequencing data from a Chronic Lymphocytic Leukemia patient we find that our approach identifies a tree not included in the set of input trees, but that contains characteristics supported by other reported evolutionary reconstructions of this tumor.
more »
« less
Modeling and predicting cancer clonal evolution with reinforcement learning
Cancer results from an evolutionary process that typically yields multiple clones with varying sets of mutations within the same tumor. Accurately modeling this process is key to understanding and predicting cancer evolution. Here, we introduce clone to mutation (CloMu), a flexible and low-parameter tree generative model of cancer evolution. CloMu uses a two-layer neural network trained via reinforcement learning to determine the probability of new mutations based on the existing mutations on a clone. CloMu supports several prediction tasks, including the determination of evolutionary trajectories, tree selection, causality and interchangeability between mutations, and mutation fitness. Importantly, previous methods support only some of these tasks, and many suffer from overfitting on data sets with a large number of mutations. Using simulations, we show that CloMu either matches or outperforms current methods on a wide variety of prediction tasks. In particular, for simulated data with interchangeable mutations, current methods are unable to uncover causal relationships as effectively as CloMu. On breast cancer and leukemia cohorts, we show that CloMu determines similarities and causal relationships between mutations as well as the fitness of mutations. We validate CloMu's inferred mutation fitness values for the leukemia cohort by comparing them to clonal proportion data not used during training, showing high concordance. In summary, CloMu's low-parameter model facilitates a wide range of prediction tasks regarding cancer evolution on increasingly available cohort-level data sets.
more »
« less
- Award ID(s):
- 2046488
- PAR ID:
- 10500552
- Publisher / Repository:
- Genome Research
- Date Published:
- Journal Name:
- Genome Research
- ISSN:
- 1088-9051
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Abstract Many models of evolution are implicitly causal processes. Features such as causal feedback between evolutionary variables and evolutionary processes acting at multiple levels, though, mean that conventional causal models miss important phenomena. We develop here a general theoretical framework for analyzing evolutionary processes drawing on recent approaches to causal modeling developed in the machine-learning literature, which have extended Pearls do-calculus to incorporate cyclic causal interactions and multilevel causation. We also develop information-theoretic notions necessary to analyze causal information dynamics in our framework, introducing a causal generalization of the Partial Information Decomposition framework. We show how our causal framework helps to clarify conceptual issues in the contexts of complex trait analysis and cancer genetics, including assigning variation in an observed trait to genetic, epigenetic and environmental sources in the presence of epigenetic and environmental feedback processes, and variation in fitness to mutation processes in cancer using a multilevel causal model respectively, as well as relating causally-induced to observed variation in these variables via information theoretic bounds. In the process, we introduce a general class of multilevel causal evolutionary processes which connect evolutionary processes at multiple levels via coarse-graining relationships. Further, we show how a range of fitness models can be formulated in our framework, as well as a causal analog of Prices equation (generalizing the probabilistic Rice equation), clarifying the relationships between realized/probabilistic fitness and direct/indirect selection. Finally, we consider the potential relevance of our framework to foundational issues in biology and evolution, including supervenience, multilevel selection and individuality. Particularly, we argue that our class of multilevel causal evolutionary processes, in conjunction with a minimum description length principle, provides a conceptual framework in which identification of multiple levels of selection may be reduced to a model selection problem.more » « less
-
Abstract Motivation In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging Single-Cell Sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. Results We present SASC (Simulated Annealing Single-Cell inference): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS data sets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real data sets and in comparison with some other available methods. Availability The Simulated Annealing Single-Cell inference (SASC) tool is open source and available at https://github.com/sciccolella/sasc. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
null (Ed.)Abstract Motivation While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. Results To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. Availability and implementation https://github.com/elkebir-group/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Understanding how mutations arise and spread through individuals and populations is fundamental to evolutionary biology. Most organisms have a life cycle with unicellular bottlenecks during reproduction. However, some organisms like plants, fungi, or colonial animals can grow indefinitely, changing the manner in which mutations spread throughout both the individual and the population. Furthermore, clonally reproducing organisms may also achieve exceedingly long lifespans, making somatic mutation an important mechanism of creating heritable variation for Darwinian evolution by natural selection. Yet, little is known about intra-organism mutation rates and evolutionary trajectories in long-lived species. Here, we study the Pando aspen clone, the largest known quaking aspen (Populus tremuloides) clone founded by a single seedling and thought to be one of the oldest studied organisms. Aspen reproduce vegetatively via new root-borne stems forming clonal patches, sometimes spanning several hectares. To study the evolutionary history of the Pando clone, we collected and sequenced over 500 samples from Pando and neighboring clones, as well as from various tissue types within Pando, including leaves, roots, and bark. We applied a series of filters to distinguish somatic mutations from the pool of both somatic and germline mutations, incorporating a technical replicate sequencing approach to account for uncertainty in somatic mutation detection. Despite root spreading being spatially constrained, we observed only a modest positive correlation between genetic and spatial distance, suggesting the presence of a mechanism preventing the accumulation and spread of mutations across units. Phylogenetic models estimate the age of the clone to between ~16,000-80,000 years. This age is generally corroborated by the near-continuous presence of aspen pollen in a lake sediment record collected from Fish Lake near Pando. Overall, this work enhances understanding of mutation accumulation and dispersal within and between ramets of long-lived, clonally-reproducing organisms. Significance StatementThis study enhances our understanding of evolutionary processes in long-lived clonal organisms by investigating somatic mutation accumulation and dispersal patterns within the iconic Pando aspen clone. The authors estimated the clone to be between 10,000 and 80,000 years old and uncovered a modest spatial genetic structure in the 42.6-hectare clone, suggesting localized mutation build-up rather than dispersal along tissue lineages. This work sheds light on an ancient organism and how plants may evolve to preserve genetic integrity in meristems fueling indefinite growth, with implications for our comprehension of adaptive strategies in long-lived perennials.more » « less
An official website of the United States government

