skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Identifiability of Level-1 Species Networks from Gene Tree Quartets
Abstract When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors—the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network. Addressing these identifiability issues is essential for designing statistically consistent inference methods.  more » « less
Award ID(s):
2051760 2331660
PAR ID:
10526649
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Bulletin of Mathematical Biology
Volume:
86
Issue:
9
ISSN:
0092-8240
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Reticulations in a phylogenetic network represent processes such as gene flow, admixture, recombination and hybrid speciation. Extending definitions from the tree setting, an anomalous network is one in which some unrooted tree topology displayed in the network appears in gene trees with a lower frequency than a tree not displayed in the network. We investigate anomalous networks under the Network Multispecies Coalescent Model with possible correlated inheritance at reticulations. Focusing on subsets of 4 taxa, we describe a new algorithm to calculate quartet concordance factors on networks of any level, faster than previous algorithms because of its focus on 4 taxa. We then study topological properties required for a 4-taxon network to be anomalous, uncovering the key role of 32-cycles: cycles of 3 edges parent to a sister group of 2 taxa. Under the model of common inheritance, that is, when each gene tree coalesces within a species tree displayed in the network, we prove that 4-taxon networks are never anomalous. Under independent and various levels of correlated inheritance, we use simulations under realistic parameters to quantify the prevalence of anomalous 4-taxon networks, finding that truly anomalous networks are rare. At the same time, however, we find a significant fraction of networks close enough to the anomaly zone to appear anomalous, when considering the quartet concordance factors observed from a few hundred genes. These apparent anomalies may challenge network inference methods. 
    more » « less
  2. Abstract Rapidly evolving taxa are excellent models for understanding the mechanisms that give rise to biodiversity. However, developing an accurate historical framework for comparative analysis of such lineages remains a challenge due to ubiquitous incomplete lineage sorting (ILS) and introgression. Here, we use a whole-genome alignment, multiple locus-sampling strategies, and summary-tree and single nucleotide polymorphism-based species-tree methods to infer a species tree for eastern North American Neodiprion species, a clade of pine-feeding sawflies (Order: Hymenopteran; Family: Diprionidae). We recovered a well-supported species tree that—except for three uncertain relationships—was robust to different strategies for analyzing whole-genome data. Nevertheless, underlying gene-tree discordance was high. To understand this genealogical variation, we used multiple linear regression to model site concordance factors estimated in 50-kb windows as a function of several genomic predictor variables. We found that site concordance factors tended to be higher in regions of the genome with more parsimony-informative sites, fewer singletons, less missing data, lower GC content, more genes, lower recombination rates, and lower D-statistics (less introgression). Together, these results suggest that ILS, introgression, and genotyping error all shape the genomic landscape of gene-tree discordance in Neodiprion. More generally, our findings demonstrate how combining phylogenomic analysis with knowledge of local genomic features can reveal mechanisms that produce topological heterogeneity across genomes. 
    more » « less
  3. Abstract Phylogenetic networks encode a broader picture of evolution by the inclusion of reticulate processes such as hybridization, introgression, or horizontal gene transfer. Each hybridization event is represented by a ‘hybridization cycle’. Here, we investigate the statistical identifiability of the position of the hybrid node in a 4-node hybridization cycle in a semi-directed level-1 phylogenetic network. That is, we investigate if our model is able to detect the correct placement of the hybrid node in the hybridization cycle using quartet concordance factors as data. In the current study, we prove that the correct placement of the hybrid node in 4-node hybridization cycles, included in level-1 phylogenetic networks, is generically identifiable if the assumptions are non-restrictive such as t∈(0,∞) for all branch (or edge) lengths and γ∈(0,1) for the inheritance probability of the hybrid edges. However, simulations show that accurate detection of these cycles can be complicated by inadequate sampling, small sample size, or gene tree estimation error. We identify practical advice for evolutionary biologists on best sampling strategies to improve the detection of this type of hybridization cycle. 
    more » « less
  4. Abstract Inferring gene regulatory networks (GRNs) from single-cell data is challenging due to heuristic limitations. Existing methods also lack estimates of uncertainty. Here we present Probabilistic Matrix Factorization for Gene Regulatory Network Inference (PMF-GRN). Using single-cell expression data, PMF-GRN infers latent factors capturing transcription factor activity and regulatory relationships. Using variational inference allows hyperparameter search for principled model selection and direct comparison to other generative models. We extensively test and benchmark our method using real single-cell datasets and synthetic data. We show that PMF-GRN infers GRNs more accurately than current state-of-the-art single-cell GRN inference methods, offering well-calibrated uncertainty estimates. 
    more » « less
  5. Inferring gene regulatory networks (GRNs) from single-cell gene expression datasets is a challenging task. Existing methods are often designed heuristically for specific datasets and lack the flexibility to incorporate additional information or compare against other algorithms. Further, current GRN inference methods do not provide uncertainty estimates with respect to the interactions that they predict, making inferred networks challenging to interpret. To overcome these challenges, we introduce Probabilistic Matrix Factorization for Gene Regulatory Network inference (PMF-GRN). PMF-GRN uses single-cell gene expression data to learn latent factors representing transcription factor activity as well as regulatory relationships between transcription factors and their target genes. This approach incorporates available experimental evidence into prior distributions over latent factors and scales well to single-cell gene expression datasets. By utilizing variational inference, we facilitate hyperparameter search for principled model selection and direct comparison to other generative models. To assess the accuracy of our method, we evaluate PMF-GRN using the model organisms Saccharomyces cerevisiae and Bacillus subtilis, benchmarking against database-derived gold standard interactions. We discover that, on average, PMF-GRN infers GRNs more accurately than current state-of-the-art single-cell GRN inference methods. Moreover, our PMF-GRN approach offers well-calibrated uncertainty estimates, as it performs gene regulatory network (GRN) inference in a probabilistic setting. These estimates are valuable for validation purposes, particularly when validated interactions are limited or a gold standard is incomplete. 
    more » « less