skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Phylogenetic inference using generative adversarial networks
The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. ResultsWe developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics.  more » « less
Award ID(s):
1936187
PAR ID:
10510511
Author(s) / Creator(s):
;
Editor(s):
Schwartz, Russell
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
9
ISSN:
1367-4811
Page Range / eLocation ID:
btad543
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Davalos, Liliana (Ed.)
    Abstract African cichlids (subfamily: Pseudocrenilabrinae) are among the most diverse vertebrates, and their propensity for repeated rapid radiation has made them a celebrated model system in evolutionary research. Nonetheless, despite numerous studies, phylogenetic uncertainty persists, and riverine lineages remain comparatively underrepresented in higher-level phylogenetic studies. Heterogeneous gene histories resulting from incomplete lineage sorting (ILS) and hybridization are likely sources of uncertainty, especially during episodes of rapid speciation. We investigate the relationships of Pseudocrenilabrinae and its close relatives while accounting for multiple sources of genetic discordance using species tree and hybrid network analyses with hundreds of single-copy exons. We improve sequence recovery for distant relatives, thereby extending the taxonomic reach of our probes, with a hybrid reference guided/de novo assembly approach. Our analyses provide robust hypotheses for most higher-level relationships and reveal widespread gene heterogeneity, including in riverine taxa. ILS and past hybridization are identified as the sources of genetic discordance in different lineages. Sampling of various Blenniiformes (formerly Ovalentaria) adds strong phylogenomic support for convict blennies (Pholidichthyidae) as sister to Cichlidae and points to other potentially useful protein-coding markers across the order. A reliable phylogeny with representatives from diverse environments will support ongoing taxonomic and comparative evolutionary research in the cichlid model system. [African cichlids; Blenniiformes; Gene tree heterogeneity; Hybrid assembly; Phylogenetic network; Pseudocrenilabrinae; Species tree.] 
    more » « less
  2. Knowledge of the internal phylogeny and evolutionary history of ants (Formicidae), the world’s most speciesrich clade of eusocial organisms, has dramatically improved since the advent of molecular phylogenetics. A number of relationships at the subfamily level, however, remain uncertain. Key unresolved issues include placement of the root of the ant tree of life and the relationships among the so-called poneroid subfamilies. Here we assemble a new data set to attempt a resolution of these two problems and carry out divergence dating, focusing on the age of the root node of crown Formicidae. For the phylogenetic analyses we included data from 110 ant species, including the key species Martialis heureka. We focused taxon sampling on non-formicoid lineages of ants to gain insight about deep nodes in the ant phylogeny. For divergence dating we retained a subset of 62 extant taxa and 42 fossils in order to approximate diversified sampling in the context of the fossilized birth-death process. We sequenced 11 nuclear gene fragments for a total of ∼7.5 kb and investigated the DNA sequence data for the presence of among-taxon compositional heterogeneity, a property known to mislead phylogenetic inference, and for its potential to affect the rooting of the ant phylogeny. We found sequences of the Leptanillinae and several outgroup taxa to be rich in adenine and thymine (51% average AT content) compared to the remaining ants (45% average). To investigate whether this heterogeneity could bias phylogenetic inference we performed outgroup removal experiments, analysis of compositionally homogeneous sites, and a simulation study. We found that compositional heterogeneity indeed appears to affect the placement of the root of the ant tree but has limited impact on more recent nodes. Our findings have implications for outgroup choice in phylogenetics, which should be made not only on the basis of close relationship to the ingroup, but should also take into account sequence divergence and other properties relative to the ingroup. We put forward a hypothesis regarding the rooting of the ant phylogeny, in which Martialis and the Leptanillinae together constitute a clade that is sister to all other ants. After correcting for compositional heterogeneity this emerges as the best-supported hypothesis of relationships at deep nodes in the ant tree. The results of our divergence dating under the fossilized birth-death process and diversified sampling suggest that the crown Formicidae originated during the Albian or Aptian ages of the Lower Cretaceous (103–124 Ma). In addition, we found support for monophyletic poneroids comprising the subfamilies Agroecomyrmecinae, Amblyoponinae, Apomyrminae, Paraponerinae, Ponerinae, and Proceratiinae, and well-supported relationships among these subfamilies except for the placement of Proceratiinae and (Amblyoponinae+Apomyrminae). Our phylogeny also highlights the non-monophyly of several ant genera, including Protanilla and Leptanilla in the Leptanillinae, Proceratium in the Proceratiinae, and Cryptopone, Euponera, and Mesoponera within the Ponerinae. 
    more » « less
  3. The development of statistical methods to infer species phylogenies with reticulations (species networks) has led to many discoveries of gene flow between distinct species. These methods typically assume only incomplete lineage sorting and introgression. Given that phylogenetic networks can be arbitrarily complex, these methods might compensate for model misspecification by increasing the number of dimensions beyond the true value. Herein, we explore the effect of potential model misspecification, including the negligence of gene tree estimation error (GTEE) and assumption of a single substitution rate for all genomic loci, on the accuracy of phylogenetic network inference using both simulated and biological data. In particular, we assess the accuracy of estimated phylogenetic networks as well as test statistics for determining whether a network is the correct evolutionary history, as opposed to the simpler model that is a tree.We found that while GTEE negatively impacts the performance of test statistics to determine the “treeness” of the evolutionary history of a data set, running those tests on triplets of taxa and correcting for multiple-testing significantly ameliorates the problem. We also found that accounting for substitution rate heterogeneity improves the reliability of full Bayesian inference methods of phylogenetic networks, whereas summary statistic methods are robust to GTEE and rate heterogeneity, though currently require manual inspection to determine the network complexity. 
    more » « less
  4. Abstract PremiseReticulate evolution, often accompanied by polyploidy, is prevalent in plants, and particularly in the ferns. Resolving the resulting non‐bifurcating histories remains a major challenge for plant phylogenetics. Here, we present a phylogenomic investigation into the complex evolutionary history of the vining ferns,Lygodium(Lygodiaceae, Schizaeales). MethodsUsing a targeted enrichment approach with theGoFlag 408flagellate land plant probe set, we generated large nuclear and plastid sequence datasets for nearly all taxa in the genus and constructed the most comprehensive phylogeny of the family to date using concatenated maximum likelihood and coalescence approaches. We integrated this phylogeny with cytological and spore data to explore karyotype evolution and generate hypotheses about the origins of putative polyploids and hybrids. ResultsOur data and analyses support the origins of several putative allopolyploids (e.g.,L. cubense, L. heterodoxum) and hybrids (e.g.,L.×fayae) and also highlight the potential prevalence of autopolyploidy in this clade (e.g.,L. articulatum, L. flexuosum, andL. longifolium). ConclusionsOur robust phylogenetic framework provides valuable insights into dynamic reticulate evolution in this clade and demonstrates the utility of target‐capture data for resolving these complex relationships. 
    more » « less
  5. Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods. 
    more » « less