skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Novel symmetry-preserving neural network model for phylogenetic inference
Abstract Scientists world-wide are putting together massive efforts to understand how the biodiversity that we see on Earth evolved from single-cell organisms at the origin of life and this diversification process is represented through the Tree of Life. Low sampling rates and high heterogeneity in the rate of evolution across sites and lineages produce a phenomenon denoted “long branch attraction” (LBA) in which long non-sister lineages are estimated to be sisters regardless of their true evolutionary relationship. LBA has been a pervasive problem in phylogenetic inference affecting different types of methodologies from distance-based to likelihood-based. Here, we present a novel neural network model that outperforms standard phylogenetic methods and other neural network implementations under LBA settings. Furthermore, unlike existing neural network models in phylogenetics, our model naturally accounts for the tree isomorphisms via permutation invariant functions which ultimately result in lower memory and allows the seamless extension to larger trees.  more » « less
Award ID(s):
2144367 2012292
PAR ID:
10501229
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Ouangraoua, Aida
Publisher / Repository:
Bioinformatics Advances
Date Published:
Journal Name:
Bioinformatics Advances
ISSN:
2635-0041
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pupko, Tal (Ed.)
    Abstract Long-branch attraction is a systematic artifact that results in erroneous groupings of fast-evolving taxa. The combination of short, deep internodes in tandem with LBA artifacts has produced empirically intractable parts of the Tree of Life. One such group is the arthropod subphylum Chelicerata, whose backbone phylogeny has remained unstable despite improvements in phylogenetic methods and genome-scale datasets. Pseudoscorpion placement is particularly variable across datasets and analytical frameworks, with this group either clustering with other long-branch orders or with Arachnopulmonata (scorpions and tetrapulmonates). To surmount LBA, we investigated the effect of taxonomic sampling via sequential deletion of basally branching pseudoscorpion superfamilies, as well as varying gene occupancy thresholds in supermatrices. We show that concatenated supermatrices and coalescent-based summary species tree approaches support a sister group relationship of pseudoscorpions and scorpions, when more of the basally branching taxa are sampled. Matrix completeness had demonstrably less influence on tree topology. As an external arbiter of phylogenetic placement, we leveraged the recent discovery of an ancient genome duplication in the common ancestor of Arachnopulmonata as a litmus test for competing hypotheses of pseudoscorpion relationships. We generated a high-quality developmental transcriptome and the first genome for pseudoscorpions to assess the incidence of arachnopulmonate-specific duplications (e.g., homeobox genes and miRNAs). Our results support the inclusion of pseudoscorpions in Arachnopulmonata (new definition), as the sister group of scorpions. Panscorpiones (new name) is proposed for the clade uniting Scorpiones and Pseudoscorpiones. 
    more » « less
  2. Abstract AimAmazonia hosts more tree species from numerous evolutionary lineages, both young and ancient, than any other biogeographic region. Previous studies have shown that tree lineages colonized multiple edaphic environments and dispersed widely across Amazonia, leading to a hypothesis, which we test, that lineages should not be strongly associated with either geographic regions or edaphic forest types. LocationAmazonia. TaxonAngiosperms (Magnoliids; Monocots; Eudicots). MethodsData for the abundance of 5082 tree species in 1989 plots were combined with a mega‐phylogeny. We applied evolutionary ordination to assess how phylogenetic composition varies across Amazonia. We used variation partitioning and Moran's eigenvector maps (MEM) to test and quantify the separate and joint contributions of spatial and environmental variables to explain the phylogenetic composition of plots. We tested the indicator value of lineages for geographic regions and edaphic forest types and mapped associations onto the phylogeny. ResultsIn the terra firme and várzea forest types, the phylogenetic composition varies by geographic region, but the igapó and white‐sand forest types retain a unique evolutionary signature regardless of region. Overall, we find that soil chemistry, climate and topography explain 24% of the variation in phylogenetic composition, with 79% of that variation being spatially structured (R2 = 19% overall for combined spatial/environmental effects). The phylogenetic composition also shows substantial spatial patterns not related to the environmental variables we quantified (R2 = 28%). A greater number of lineages were significant indicators of geographic regions than forest types. Main ConclusionNumerous tree lineages, including some ancient ones (>66 Ma), show strong associations with geographic regions and edaphic forest types of Amazonia. This shows that specialization in specific edaphic environments has played a long‐standing role in the evolutionary assembly of Amazonian forests. Furthermore, many lineages, even those that have dispersed across Amazonia, dominate within a specific region, likely because of phylogenetically conserved niches for environmental conditions that are prevalent within regions. 
    more » « less
  3. Abstract MotivationThe abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process which cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard. Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of four-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. ResultsOur novel hybrid detection methodology is optimization-free as it only requires the evaluation of polynomial equations, and as such, it bypasses the traversal of network space, yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate our method’s performance on simulated and real data from the genus Canis. Availability and implementationWe present an open-source publicly available Julia package PhyloDiamond.jl available at https://github.com/solislemuslab/PhyloDiamond.jl with broad applicability within the evolutionary community. 
    more » « less
  4. Burbrink, Frank (Ed.)
    Abstract In cryptic amphibian complexes, there is a growing trend to equate high levels of genetic structure with hidden cryptic species diversity. Typically, phylogenetic structure and distance-based approaches are used to demonstrate the distinctness of clades and justify the recognition of new cryptic species. However, this approach does not account for gene flow, spatial, and environmental processes that can obfuscate phylogenetic inference and bias species delimitation. As a case study, we sequenced genome-wide exons and introns to evince the processes that underlie the diversification of Philippine Puddle Frogs—a group that is widespread, phenotypically conserved, and exhibits high levels of geographically based genetic structure. We showed that widely adopted tree- and distance-based approaches inferred up to 20 species, compared to genomic analyses that inferred an optimal number of five distinct genetic groups. Using a suite of clustering, admixture, and phylogenetic network analyses, we demonstrate extensive admixture among the five groups and elucidate two specific ways in which gene flow can cause overestimations of species diversity: 1) admixed populations can be inferred as distinct lineages characterized by long branches in phylograms; and 2) admixed lineages can appear to be genetically divergent, even from their parental populations when simple measures of genetic distance are used. We demonstrate that the relationship between mitochondrial and genome-wide nuclear $$p$$-distances is decoupled in admixed clades, leading to erroneous estimates of genetic distances and, consequently, species diversity. Additionally, genetic distance was also biased by spatial and environmental processes. Overall, we showed that high levels of genetic diversity in Philippine Puddle Frogs predominantly comprise metapopulation lineages that arose through complex patterns of admixture, isolation-by-distance, and isolation-by-environment as opposed to species divergence. Our findings suggest that speciation may not be the major process underlying the high levels of hidden diversity observed in many taxonomic groups and that widely adopted tree- and distance-based methods overestimate species diversity in the presence of gene flow. [Cryptic species; gene flow; introgression; isolation-by-distance; isolation-by-environment; phylogenetic network; species delimitation.] 
    more » « less
  5. Abstract Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without prespecified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multilocus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data. [Deep learning; gene tree discordance; metagenomics; microbiome analyses; neural networks; phylogenetic placement.] 
    more » « less