skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Phylogenetic inference using generative adversarial networks
Abstract MotivationThe application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. ResultsWe developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. Availability and implementationphyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.  more » « less
Award ID(s):
1936187
PAR ID:
10462820
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
9
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationThe abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process which cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard. Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of four-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. ResultsOur novel hybrid detection methodology is optimization-free as it only requires the evaluation of polynomial equations, and as such, it bypasses the traversal of network space, yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate our method’s performance on simulated and real data from the genus Canis. Availability and implementationWe present an open-source publicly available Julia package PhyloDiamond.jl available at https://github.com/solislemuslab/PhyloDiamond.jl with broad applicability within the evolutionary community. 
    more » « less
  2. Davalos, Liliana (Ed.)
    Abstract African cichlids (subfamily: Pseudocrenilabrinae) are among the most diverse vertebrates, and their propensity for repeated rapid radiation has made them a celebrated model system in evolutionary research. Nonetheless, despite numerous studies, phylogenetic uncertainty persists, and riverine lineages remain comparatively underrepresented in higher-level phylogenetic studies. Heterogeneous gene histories resulting from incomplete lineage sorting (ILS) and hybridization are likely sources of uncertainty, especially during episodes of rapid speciation. We investigate the relationships of Pseudocrenilabrinae and its close relatives while accounting for multiple sources of genetic discordance using species tree and hybrid network analyses with hundreds of single-copy exons. We improve sequence recovery for distant relatives, thereby extending the taxonomic reach of our probes, with a hybrid reference guided/de novo assembly approach. Our analyses provide robust hypotheses for most higher-level relationships and reveal widespread gene heterogeneity, including in riverine taxa. ILS and past hybridization are identified as the sources of genetic discordance in different lineages. Sampling of various Blenniiformes (formerly Ovalentaria) adds strong phylogenomic support for convict blennies (Pholidichthyidae) as sister to Cichlidae and points to other potentially useful protein-coding markers across the order. A reliable phylogeny with representatives from diverse environments will support ongoing taxonomic and comparative evolutionary research in the cichlid model system. [African cichlids; Blenniiformes; Gene tree heterogeneity; Hybrid assembly; Phylogenetic network; Pseudocrenilabrinae; Species tree.] 
    more » « less
  3. Abstract PremiseTo date, phylogenetic relationships within the monogeneric Brunelliaceae have been based on morphological evidence, which does not provide sufficient phylogenetic resolution. Here we use target‐enriched nuclear data to improve our understanding of phylogenetic relationships in the family. MethodsWe used the Angiosperms353 toolkit for targeted recovery of exonic regions and supercontigs (exons + introns) from low copy nuclear genes from 53 of 70 species inBrunellia, and several outgroup taxa. We removed loci that indicated biased inference of relationships and applied concatenated and coalescent methods to inferBrunelliaphylogeny. We identified conflicts among gene trees that may reflect hybridization or incomplete lineage sorting events and assessed their impact on phylogenetic inference. Finally, we performed ancestral‐state reconstructions of morphological traits and assessed the homology of character states used to define sections and subsections inBrunellia. ResultsBrunelliacomprises two major clades and several subclades. Most of these clades/subclades do not correspond to previous infrageneric taxa. There is high topological incongruence among the subclades across analyses. ConclusionsPhylogenetic reconstructions point to rapid species diversification in Brunelliaceae, reflected in very short branches between successive species splits. The removal of putatively biased loci slightly improves phylogenetic support for individual clades. Reticulate evolution due to hybridization and/or incomplete lineage sorting likely both contribute to gene‐tree discordance. Morphological characters used to define taxa in current classification schemes are homoplastic in the ancestral character‐state reconstructions. While target enrichment data allows us to broaden our understanding of diversification inBrunellia, the relationships among subclades remain incompletely understood. 
    more » « less
  4. PremiseLarge genomic data sets offer the promise of resolving historically recalcitrant species relationships. However, different methodologies can yield conflicting results, especially when clades have experienced ancient, rapid diversification. Here, we analyzed the ancient radiation of Ericales and explored sources of uncertainty related to species tree inference, conflicting gene tree signal, and the inferred placement of gene and genome duplications. MethodsWe used a hierarchical clustering approach, with tree‐based homology and orthology detection, to generate six filtered phylogenomic matrices consisting of data from 97 transcriptomes and genomes. Support for species relationships was inferred from multiple lines of evidence including shared gene duplications, gene tree conflict, gene‐wise edge‐based analyses, concatenation, and coalescent‐based methods, and is summarized in a consensus framework. ResultsOur consensus approach supported a topology largely concordant with previous studies, but suggests that the data are not capable of resolving several ancient relationships because of lack of informative characters, sensitivity to methodology, and extensive gene tree conflict correlated with paleopolyploidy. We found evidence of a whole‐genome duplication before the radiation of all or most ericalean families, and demonstrate that tree topology and heterogeneous evolutionary rates affect the inferred placement of genome duplications. ConclusionsWe provide several hypotheses regarding the history of Ericales, and confidently resolve most nodes, but demonstrate that a series of ancient divergences are unresolvable with these data. Whether paleopolyploidy is a major source of the observed phylogenetic conflict warrants further investigation. 
    more » « less
  5. Abstract PremiseReticulate evolution, often accompanied by polyploidy, is prevalent in plants, and particularly in the ferns. Resolving the resulting non‐bifurcating histories remains a major challenge for plant phylogenetics. Here, we present a phylogenomic investigation into the complex evolutionary history of the vining ferns,Lygodium(Lygodiaceae, Schizaeales). MethodsUsing a targeted enrichment approach with theGoFlag 408flagellate land plant probe set, we generated large nuclear and plastid sequence datasets for nearly all taxa in the genus and constructed the most comprehensive phylogeny of the family to date using concatenated maximum likelihood and coalescence approaches. We integrated this phylogeny with cytological and spore data to explore karyotype evolution and generate hypotheses about the origins of putative polyploids and hybrids. ResultsOur data and analyses support the origins of several putative allopolyploids (e.g.,L. cubense, L. heterodoxum) and hybrids (e.g.,L.×fayae) and also highlight the potential prevalence of autopolyploidy in this clade (e.g.,L. articulatum, L. flexuosum, andL. longifolium). ConclusionsOur robust phylogenetic framework provides valuable insights into dynamic reticulate evolution in this clade and demonstrates the utility of target‐capture data for resolving these complex relationships. 
    more » « less