skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A divide-and-conquer method for scalable phylogenetic network inference from multilocus data
Abstract MotivationReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. ResultsIn this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementationWe implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary informationSupplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1800723
PAR ID:
10425982
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
35
Issue:
14
ISSN:
1367-4803
Page Range / eLocation ID:
p. i370-i378
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract PremiseRubiaceae is among the most species‐rich plant families, as well as one of the most morphologically and geographically diverse. Currently available phylogenies have mostly relied on few genomic and plastid loci, as opposed to large‐scale genomic data. Target enrichment provides the ability to generate sequence data for hundreds to thousands of phylogenetically informative, single‐copy loci, which often leads to improved phylogenetic resolution at both shallow and deep taxonomic scales; however, a publicly accessible Rubiaceae‐specific probe set that allows for comparable phylogenetic inference across clades is lacking. MethodsHere, we use publicly accessible genomic resources to identify putatively single‐copy nuclear loci for target enrichment in two Rubiaceae groups: tribe Hillieae (Cinchonoideae) and tribal complex Palicoureeae+Psychotrieae (Rubioideae). We sequenced 2270 exonic regions corresponding to 1059 loci in our target clades and generated in silico target enrichment sequences for other Rubiaceae taxa using our designed probe set. To test the utility of our probe set for phylogenetic inference across Rubiaceae, we performed a coalescent‐aware phylogenetic analysis using a subset of 27 Rubiaceae taxa from 10 different tribes and three subfamilies, and one outgroup in Apocynaceae. ResultsWe recovered an average of 75% and 84% of targeted exons and loci, respectively, per Rubiaceae sample. Probes designed using genomic resources from a particular subfamily were most efficient at targeting sequences from taxa in that subfamily. The number of paralogs recovered during assembly varied for each clade. Phylogenetic inference of Rubiaceae with our target regions resolves relationships at various scales. Relationships are largely consistent with previous studies of relationships in the family with high support (≥0.98 local posterior probability) at nearly all nodes and evidence of gene tree discordance. DiscussionOur probe set, which we call Rubiaceae2270x, was effective for targeting loci in species across and even outside of Rubiaceae. This probe set will facilitate phylogenomic studies in Rubiaceae and advance systematics and macroevolutionary studies in the family. 
    more » « less
  2. The reconstruction of phylogenetic networks is an important but challenging problem in phylogenetics and genome evolution, as the space of phylogenetic networks is vast and cannot be sampled well. One approach to the problem is to solve the minimum phylogenetic network problem, in which phylogenetic trees are first inferred, and then the smallest phylogenetic network that displays all the trees is computed. The approach takes advantage of the fact that the theory of phylogenetic trees is mature, and there are excellent tools available for inferring phylogenetic trees from a large number of biomolecular sequences. A tree–child network is a phylogenetic network satisfying the condition that every nonleaf node has at least one child that is of indegree one. Here, we develop a new method that infers the minimum tree–child network by aligning lineage taxon strings in the phylogenetic trees. This algorithmic innovation enables us to get around the limitations of the existing programs for phylogenetic network inference. Our new program, named ALTS, is fast enough to infer a tree–child network with a large number of reticulations for a set of up to 50 phylogenetic trees with 50 taxa that have only trivial common clusters in about a quarter of an hour on average. 
    more » « less
  3. Abstract MotivationProtein function prediction, based on the patterns of connection in a protein–protein interaction (or association) network, is perhaps the most studied of the classical, fundamental inference problems for biological networks. A highly successful set of recent approaches use random walk-based low-dimensional embeddings that tend to place functionally similar proteins into coherent spatial regions. However, these approaches lose valuable local graph structure from the network when considering only the embedding. We introduce GLIDER, a method that replaces a protein–protein interaction or association network with a new graph-based similarity network. GLIDER is based on a variant of our previous GLIDE method, which was designed to predict missing links in protein–protein association networks, capturing implicit local and global (i.e. embedding-based) graph properties. ResultsGLIDER outperforms competing methods on the task of predicting GO functional labels in cross-validation on a heterogeneous collection of four human protein–protein association networks derived from the 2016 DREAM Disease Module Identification Challenge, and also on three different protein–protein association networks built from the STRING database. We show that this is due to the strong functional enrichment that is present in the local GLIDER neighborhood in multiple different types of protein–protein association networks. Furthermore, we introduce the GLIDER graph neighborhood as a way for biologists to visualize the local neighborhood of a disease gene. As an application, we look at the local GLIDER neighborhoods of a set of known Parkinson’s Disease GWAS genes, rediscover many genes which have known involvement in Parkinson’s disease pathways, plus suggest some new genes to study. Availability and implementationAll code is publicly available and can be accessed here: https://github.com/kap-devkota/GLIDER. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  4. Abstract MotivationReconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. ResultsIn this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. Availability and implementationEnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract MotivationThe application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. ResultsWe developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. Availability and implementationphyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/. 
    more » « less