skip to main content

Title: A Consensus Approach to Infer Tumor Evolutionary Histories
Inspired by recent efforts to model cancer evolution with phylogenetic trees, we consider the problem of finding a consensus tumor evolution tree from a set of conflicting input trees. In contrast to traditional phylogenetic trees, the tumor trees we consider contain features such as mutation labels on internal vertices (in addition to the leaves) and allow multiple mutations to label a single vertex. We describe several distance measures between these tumor trees and present an algorithm to solve the consensus problem called GraPhyC. Our approach uses a weighted directed graph where vertices are sets of mutations and edges are weighted using a function that depends on the number of times a parental relationship is observed between their constituent mutations in the set of input trees. We find a minimum weight spanning arborescence in this graph and prove that the resulting tree minimizes the total distance to all input trees for one of our presented distance measures. We evaluate our GraPhyC method using both simulated and real data. On simulated data we show that our method outperforms a baseline method at finding an appropriate representative tree. Using a set of tumor trees derived from both whole-genome and deep sequencing data from more » a Chronic Lymphocytic Leukemia patient we find that our approach identifies a tree not included in the set of input trees, but that contains characteristics supported by other reported evolutionary reconstructions of this tumor. « less
; ;
Award ID(s):
Publication Date:
Journal Name:
BCB '18 Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
Page Range or eLocation-ID:
63 to 72
Sponsoring Org:
National Science Foundation
More Like this
  1. Vehicle routing problems are a broad class of combinatorial optimization problems that can be formulated as the problem of finding a tour in a weighted graph that optimizes some function of the visited vertices. For instance, a canonical and extensively studied vehicle routing problem is the orienteering problem where the goal is to find a tour that maximizes the number of vertices visited by a given deadline. In this paper, we consider the computational tractability of a well-known generalization of the orienteering problem called the Orient-MTW problem. The input to Orient-MTW consists of a weighted graph G(V, E) where formore »each vertex v ∊ V we are given a set of time instants Tv ⊆ [T], and a source vertex s. A tour starting at s is said to visit a vertex v if it transits through v at any time in the set Tv. The goal is to find a tour starting at the source vertex that maximizes the number of vertices visited. It is known that this problem admits a quasi-polynomial time O(log OPT)-approximation ratio where OPT is the optimal solution value but until now no hardness better than an APX-hardness was known for this problem. Our main result is an -hardness for this problem that holds even when the underlying graph G is an undirected tree. This is the first super-constant hardness result for the Orient-MTW problem. The starting point for our result is the hardness of the SetCover problem which is known to hold on instances with a special structure. We exploit this special structure of the hard SetCover instances to first obtain a new proof of the APX-hardness result for Orient-MTW that holds even on trees of depth 2. We then recursively amplify this constant factor hardness to an -hardness, while keeping the resulting topology to be a tree. Our amplified hardness proof crucially utilizes a delicate concavity property which shows that in our encoding of SetCover instances as instances of the Orient-MTW problem, whenever the optimal cost for SetCover instance is large, any tour, no matter how it allocates its time across different sub-trees, can not visit too many vertices overall. We believe that this reduction template may also prove useful in showing hardness of other vehicle routing problems.« less
  2. Abstract Motivation There has been recent increased interest in using algorithmic methods to infer the evolutionary tree underlying the developmental history of a tumor. Quantitative measures that compare such trees are vital to a number of different applications including benchmarking tree inference methods and evaluating common inheritance patterns across patients. However, few appropriate distance measures exist, and those that do have low resolution for differentiating trees or do not fully account for the complex relationship between tree topology and the inheritance of the mutations labeling that topology. Results Here we present two novel distance measures, Common Ancestor Set distance (CASet)more »and Distinctly Inherited Set Comparison distance (DISC), that are specifically designed to account for the subclonal mutation inheritance patterns characteristic of tumor evolutionary trees. We apply CASet and DISC to multiple simulated datasets and two breast cancer datasets and show that our distance measures allow for more nuanced and accurate delineation between tumor evolutionary trees than existing distance measures. Availability and implementation Implementations of CASet and DISC are freely available at: Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Abstract Motivation While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. Results To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate themore »Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. Availability and implementation Supplementary information Supplementary data are available at Bioinformatics online.« less
  4. Abstract Motivation Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes atmore »low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. Results We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. Availability and implementation The software and data are available at and Supplementary information Supplementary data are available at Bioinformatics online.« less
  5. Abstract Despite the ubiquitous use of statistical models for phylogenomic and population genomic inferences, this model-based rigor is rarely applied to post hoc comparison of trees. In a recent study, Garba et al. derived new methods for measuring the distance between two gene trees computed as the difference in their site pattern probability distributions. Unlike traditional metrics that compare trees solely in terms of geometry, these measures consider gene trees and associated parameters as probabilistic models that can be compared using standard information theoretic approaches. Consequently, probabilistic measures of phylogenetic tree distance can be far more informative than simply comparisonsmore »of topology and/or branch lengths alone. However, in their current form, these distance measures are not suitable for the comparison of species tree models in the presence of gene tree heterogeneity. Here, we demonstrate an approach for how the theory of Garba et al. (2018), which is based on gene tree distances, can be extended naturally to the comparison of species tree models. Multispecies coalescent (MSC) models parameterize the discrete probability distribution of gene trees conditioned upon a species tree with a particular topology and set of divergence times (in coalescent units), and thus provide a framework for measuring distances between species tree models in terms of their corresponding gene tree topology probabilities. We describe the computation of probabilistic species tree distances in the context of standard MSC models, which assume complete genetic isolation postspeciation, as well as recent theoretical extensions to the MSC in the form of network-based MSC models that relax this assumption and permit hybridization among taxa. We demonstrate these metrics using simulations and empirical species tree estimates and discuss both the benefits and limitations of these approaches. We make our species tree distance approach available as an R package called pSTDistanceR, for open use by the community.« less