Just as a phylogeny encodes the evolutionary relationships among a group of organisms, a cophylogeny represents the coevolutionary relationships among symbiotic partners. Both are primarily reconstructed using computational analysis of biomolecular sequence data. The most widely used cophylogenetic reconstruction methods utilize an important simplifying assumption: species phylogenies for each set of coevolved taxa are required as input and assumed to be correct. Many studies have shown that this assumption is rarely – if ever – satisfied, and the consequences for cophylogenetic studies are poorly understood. To address this gap, we conduct a comprehensive performance study that quantifies the relationship between species tree estimation error and downstream cophylogenetic estimation accuracy. We study the performance of state-of-the-art methods for cophylogenetic reconstruction using in silico model-based simulations. Our investigation also assessed cophylogenetic reproducibility using genomic sequence data from two important models of symbiosis: soil-associated fungi and their endosymbiotic bacteria, and bobtail squid and their bioluminescent bacterial symbionts. Our findings conclusively demonstrate the major impact that upstream phylogenetic estimation error has on downstream cophylogenetic reconstruction. Relative to other experimental factors such as cophylogenetic estimation method choice and coevolutionary event costs, phylogenetic estimation error ranked highest in importance based on a random forest-based variable importance assessment. We conclude with practical guidance and future research directions. Among the many considerations needed for accurate cophylogenetic reconstruction – choice of computational method, method settings, sampling design, and others – just as much attention must be paid to careful species phylogeny estimation using modern best practices.
more »
« less
FastNet: Fast and Accurate Statistical Inference of Phylogenetic Networks Using Large-Scale Genomic Sequence Data
An emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organ- isms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal” gene flow? The answer to this fundamental question not only depends upon densely sam- pled and divergent genomic sequence data, but also computational meth- ods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Recent methodolog- ical advances have attempted to address this gap. However, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phylogenomic studies. The methodological gap remains: how can phylogenetic networks be accurately and efficiently inferred using genomic sequence data involv- ing many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolutionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topo- logical accuracy.
more »
« less
- PAR ID:
- 10105945
- Date Published:
- Journal Name:
- RECOMB-CG 2018. Lecture Notes in Computer Science
- Volume:
- 11183
- Page Range / eLocation ID:
- 242-259
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The development of statistical methods to infer species phylogenies with reticulations (species networks) has led to many discoveries of gene flow between distinct species. These methods typically assume only incomplete lineage sorting and introgression. Given that phylogenetic networks can be arbitrarily complex, these methods might compensate for model misspecification by increasing the number of dimensions beyond the true value. Herein, we explore the effect of potential model misspecification, including the negligence of gene tree estimation error (GTEE) and assumption of a single substitution rate for all genomic loci, on the accuracy of phylogenetic network inference using both simulated and biological data. In particular, we assess the accuracy of estimated phylogenetic networks as well as test statistics for determining whether a network is the correct evolutionary history, as opposed to the simpler model that is a tree.We found that while GTEE negatively impacts the performance of test statistics to determine the “treeness” of the evolutionary history of a data set, running those tests on triplets of taxa and correcting for multiple-testing significantly ameliorates the problem. We also found that accounting for substitution rate heterogeneity improves the reliability of full Bayesian inference methods of phylogenetic networks, whereas summary statistic methods are robust to GTEE and rate heterogeneity, though currently require manual inspection to determine the network complexity.more » « less
-
Species tree estimation is a basic part of many biological research projects, ranging from answering basic evolutionary questions (e.g., how did a group of species adapt to their environments?) to addressing questions in functional biology. Yet, species tree estimation is very challenging, due to processes such as incomplete lineage sorting, gene duplication and loss, horizontal gene transfer, and hybridization, which can make gene trees differ from each other and from the overall evolutionary history of the species. Over the last 10–20 years, there has been tremendous growth in methods and mathematical theory for estimating species trees and phylogenetic networks, and some of these methods are now in wide use. In this survey, we provide an overview of the current state of the art, identify the limitations of existing methods and theory, and propose additional research problems and directions.more » « less
-
Efficiently capturing the long-range patterns in sequential data sources salient to a given task -- such as classification and generative modeling -- poses a fundamental challenge. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration and comparison, as in transformers, the computational burden of complicated sequential dependencies, as in recurrent neural networks, or the parameter burden of convolutional networks with many or large filters. We instead take inspiration from wavelet-based multiresolution analysis to define a new building block for sequence modeling, which we call a MultiresLayer. The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence. Our MultiresConv can be implemented with shared filters across a dilated causal convolution tree. Thus it garners the computational advantages of convolutional networks and the principled theoretical motivation of wavelet decompositions. Our MultiresLayer is straightforward to implement, requires significantly fewer parameters, and maintains at most a (NlogN) memory footprint for a length N sequence. Yet, by stacking such layers, our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks using CIFAR-10, ListOps, and PTB-XL datasets.more » « less
-
Abstract Phylogenomic data from a rapidly increasing number of studies provide new evidence for resolving relationships in recently radiated clades, but they also pose new challenges for inferring evolutionary histories. Most existing methods for reconstructing phylogenetic hypotheses rely solely on algorithms that only consider incomplete lineage sorting (ILS) as a cause of intra- or intergenomic discordance. Here, we utilize a variety of methods, including those to infer phylogenetic networks, to account for both ILS and introgression as a cause for nuclear and cytoplasmic-nuclear discordance using phylogenomic data from the recently radiated flowering plant genus Polemonium (Polemoniaceae), an ecologically diverse genus in Western North America with known and suspected gene flow between species. We find evidence for widespread discordance among nuclear loci that can be explained by both ILS and reticulate evolution in the evolutionary history of Polemonium. Furthermore, the histories of organellar genomes show strong discordance with the inferred species tree from the nuclear genome. Discordance between the nuclear and plastid genome is not completely explained by ILS, and only one case of discordance is explained by detected introgression events. Our results suggest that multiple processes have been involved in the evolutionary history of Polemonium and that the plastid genome does not accurately reflect species relationships. We discuss several potential causes for this cytoplasmic-nuclear discordance, which emerging evidence suggests is more widespread across the Tree of Life than previously thought. [Cyto-nuclear discordance, genomic discordance, phylogenetic networks, plastid capture, Polemoniaceae, Polemonium, reticulations.]more » « less
An official website of the United States government

