skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Assessing the fit of the multi-species network coalescent to multi-locus data
Abstract Motivation With growing genome-wide molecular datasets from next-generation sequencing, phylogenetic networks can be estimated using a variety of approaches. These phylogenetic networks include events like hybridization, gene flow or horizontal gene transfer explicitly. However, the most accurate network inference methods are computationally heavy. Methods that scale to larger datasets do not calculate a full likelihood, such that traditional likelihood-based tools for model selection are not applicable to decide how many past hybridization events best fit the data. We propose here a goodness-of-fit test to quantify the fit between data observed from genome-wide multi-locus data, and patterns expected under the multi-species coalescent model on a candidate phylogenetic network. Results We identified weaknesses in the previously proposed TICR test, and proposed corrections. The performance of our new test was validated by simulations on real-world phylogenetic networks. Our test provides one of the first rigorous tools for model selection, to select the adequate network complexity for the data at hand. The test can also work for identifying poorly inferred areas on a network. Availability and implementation Software for the goodness-of-fit test is available as a Julia package at https://github.com/cecileane/QuartetNetworkGoodnessFit.jl. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1902892 2023239
PAR ID:
10225145
Author(s) / Creator(s):
;
Editor(s):
Russell, Schwartz
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogenetic network inference focused exclusively on reticulations and sought networks with the fewest number of reticulations to fit the data. As processes such as incomplete lineage sorting (ILS) could be at play concurrently with hybridization, work in the last decade has shifted to computational approaches for phylogenetic network inference in the presence of ILS. In such a short period, significant advances have been made on developing and implementing such computational approaches. In particular, parsimony, likelihood, and Bayesian methods have been devised for estimating phylogenetic networks and associated parameters using estimated gene trees as data. Use of those inference methods has been augmented with statistical tests for specific hypotheses of hybridization, like the D-statistic. Most recently, Bayesian approaches for inferring phylogenetic networks directly from sequence data were developed and implemented. In this chapter, we survey such advances and discuss model assumptions as well as methods’ strengths and limitations. We also discuss parallel efforts in the population genetics community aimed at inferring similar structures. Finally, we highlight major directions for future research in this area. 
    more » « less
  2. Martelli, Pier Luigi (Ed.)
    Abstract Motivation Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. Results In this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. Availability and implementation EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Motivation Detecting cancer gene expression and transcriptome changes with mRNA-sequencing (RNA-Seq) or array-based data are important for understanding the molecular mechanisms underlying carcinogenesis and cellular events during cancer progression. In previous studies, the differentially expressed genes were detected across patients in one cancer type. These studies ignored the role of mRNA expression changes in driving tumorigenic mechanisms that are either universal or specific in different tumor types. To address the problem, we introduce two network-based multi-task learning frameworks, NetML and NetSML, to discover common differentially expressed genes shared across different cancer types as well as differentially expressed genes specific to each cancer type. The proposed frameworks consider the common latent gene co-expression modules and gene-sample biclusters underlying the multiple cancer datasets to learn the knowledge crossing different tumor types. Results Large-scale experiments on simulations and real cancer high-throughput datasets validate that the proposed network-based multi-task learning frameworks perform better sample classification compared with the models without the knowledge sharing across different cancer types. The common and cancer specific molecular signatures detected by multi-task learning frameworks on TCGA ovarian cancer, breast cancer, and prostate cancer datasets are correlated with the known marker genes and enriched in cancer relevant KEGG pathways and Gene Ontology terms. Availability and Implementation Source code is available at: https://github.com/compbiolabucf/NetML Supplementary information Supplementary data are available at Bioinformatics 
    more » « less
  4. Robinson, Peter (Ed.)
    Abstract Motivation Accurate disease phenotype prediction plays an important role in the treatment of heterogeneous diseases like cancer in the era of precision medicine. With the advent of high throughput technologies, more comprehensive multi-omics data is now available that can effectively link the genotype to phenotype. However, the interactive relation of multi-omics datasets makes it particularly challenging to incorporate different biological layers to discover the coherent biological signatures and predict phenotypic outcomes. In this study, we introduce omicsGAN, a generative adversarial network model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals. Results Large-scale experiments on The Cancer Genome Atlas breast cancer, lung cancer and ovarian cancer datasets validate that (i) the model can effectively integrate two omics data (e.g. mRNA and microRNA expression data) and their interaction network (e.g. microRNA-mRNA interaction network). The synthetic omics data generated by the proposed model has a better performance on cancer outcome classification and patients survival prediction compared to original omics datasets. (ii) The integrity of the interaction network plays a vital role in the generation of synthetic data with higher predictive quality. Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals. Availability and implementation Source code is available at: https://github.com/CompbioLabUCF/omicsGAN. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer. Recently, network-based Markov models of DNA sequence evolution have been introduced along with model-based methods for reconstructing phylogenetic networks. For these methods to be consistent, the network parameter needs to be identifiable from data generated under the model. Here, we show that the semi-directed network parameter of a triangle-free, level-1 network model with any fixed number of reticulation vertices is generically identifiable under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints. 
    more » « less