skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: A divide-and-conquer method for scalable phylogenetic network inference from multilocus data
Abstract Motivation

Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes.

Results

In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.

Availability and implementation

We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet).

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
Award ID(s):
1800723
PAR ID:
10425982
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
35
Issue:
14
ISSN:
1367-4803
Page Range / eLocation ID:
p. i370-i378
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Premise

    Rubiaceae is among the most species‐rich plant families, as well as one of the most morphologically and geographically diverse. Currently available phylogenies have mostly relied on few genomic and plastid loci, as opposed to large‐scale genomic data. Target enrichment provides the ability to generate sequence data for hundreds to thousands of phylogenetically informative, single‐copy loci, which often leads to improved phylogenetic resolution at both shallow and deep taxonomic scales; however, a publicly accessible Rubiaceae‐specific probe set that allows for comparable phylogenetic inference across clades is lacking.

    Methods

    Here, we use publicly accessible genomic resources to identify putatively single‐copy nuclear loci for target enrichment in two Rubiaceae groups: tribe Hillieae (Cinchonoideae) and tribal complex Palicoureeae+Psychotrieae (Rubioideae). We sequenced 2270 exonic regions corresponding to 1059 loci in our target clades and generated in silico target enrichment sequences for other Rubiaceae taxa using our designed probe set. To test the utility of our probe set for phylogenetic inference across Rubiaceae, we performed a coalescent‐aware phylogenetic analysis using a subset of 27 Rubiaceae taxa from 10 different tribes and three subfamilies, and one outgroup in Apocynaceae.

    Results

    We recovered an average of 75% and 84% of targeted exons and loci, respectively, per Rubiaceae sample. Probes designed using genomic resources from a particular subfamily were most efficient at targeting sequences from taxa in that subfamily. The number of paralogs recovered during assembly varied for each clade. Phylogenetic inference of Rubiaceae with our target regions resolves relationships at various scales. Relationships are largely consistent with previous studies of relationships in the family with high support (≥0.98 local posterior probability) at nearly all nodes and evidence of gene tree discordance.

    Discussion

    Our probe set, which we call Rubiaceae2270x, was effective for targeting loci in species across and even outside of Rubiaceae. This probe set will facilitate phylogenomic studies in Rubiaceae and advance systematics and macroevolutionary studies in the family.

     
    more » « less
  2. Abstract Motivation

    A phylogenetic network is a powerful model to represent entangled evolutionary histories with both divergent (speciation) and convergent (e.g. hybridization, reassortment, recombination) evolution. The standard approach to inference of hybridization networks is to (i) reconstruct rooted gene trees and (ii) leverage gene tree discordance for network inference. Recently, we introduced a method called RF-Net for accurate inference of virus reassortment and hybridization networks from input gene trees in the presence of errors commonly found in phylogenetic trees. While RF-Net demonstrated the ability to accurately infer networks with up to four reticulations from erroneous input gene trees, its application was limited by the number of reticulations it could handle in a reasonable amount of time. This limitation is particularly restrictive in the inference of the evolutionary history of segmented RNA viruses such as influenza A virus (IAV), where reassortment is one of the major mechanisms shaping the evolution of these pathogens.

    Results

    Here, we expand the functionality of RF-Net that makes it significantly more applicable in practice. Crucially, we introduce a fast extension to RF-Net, called Fast-RF-Net, that can handle large numbers of reticulations without sacrificing accuracy. In addition, we develop automatic stopping criteria to select the appropriate number of reticulations heuristically and implement a feature for RF-Net to output error-corrected input gene trees. We then conduct a comprehensive study of the original method and its novel extensions and confirm their efficacy in practice using extensive simulation and empirical IAV evolutionary analyses.

    Availability and implementation

    RF-Net 2 is available at https://github.com/flu-crew/rf-net-2.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.

    Results

    We develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.

    Availability and implementation

    The NExUS source code is freely available for download at https://github.com/priyamdas2/NExUS.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method’s applicability.

    Results

    In this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss.

    Availability and implementation

    The methods have been implemented in PhyloNet (http://bioinfocs.rice.edu/phylonet).

     
    more » « less
  5. The reconstruction of phylogenetic networks is an important but challenging problem in phylogenetics and genome evolution, as the space of phylogenetic networks is vast and cannot be sampled well. One approach to the problem is to solve the minimum phylogenetic network problem, in which phylogenetic trees are first inferred, and then the smallest phylogenetic network that displays all the trees is computed. The approach takes advantage of the fact that the theory of phylogenetic trees is mature, and there are excellent tools available for inferring phylogenetic trees from a large number of biomolecular sequences. A tree–child network is a phylogenetic network satisfying the condition that every nonleaf node has at least one child that is of indegree one. Here, we develop a new method that infers the minimum tree–child network by aligning lineage taxon strings in the phylogenetic trees. This algorithmic innovation enables us to get around the limitations of the existing programs for phylogenetic network inference. Our new program, named ALTS, is fast enough to infer a tree–child network with a large number of reticulations for a set of up to 50 phylogenetic trees with 50 taxa that have only trivial common clusters in about a quarter of an hour on average.

     
    more » « less