skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A PhyloFisher utility for nucleotide-based phylogenomic matrix construction; nucl_matrix_constructor.py
ABSTRACT Phylogenies built from multiple genes have become a common component of evolutionary biology studies. Molecular phylogenomic matrices used to build multi-gene phylogenies can be built from either nucleotide or protein matrices. Nucleotide-based analyses are often more appropriate for addressing phylogenetic questions in evolutionarily shallow timescales (i.e., less than 100 million years) while protein-based analyses are often more appropriate for addressing deep phylogenetic questions. PhyloFisher is a phylogenomic software package written in Python3. The manually curated PhyloFisher database contains 240 protein-coding genes from 304 eukaryotic taxa. Here we presentnucl_matrix_constructor.py, an expansion of the PhyloFisher starting database, and an update to PhyloFisher that maintains DNA sequences. This combination will allow users the ability to easily build nucleotide phylogenomic matrices while retaining the benefits of protein-based pre-processing used to identify contaminants and paralogy.  more » « less
Award ID(s):
2100888
PAR ID:
10574813
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract PhyloFisher is a software package written primarily in Python3 that can be used for the creation, analysis, and visualization of phylogenomic datasets that consist of protein sequences from eukaryotic organisms. Unlike many existing phylogenomic pipelines, PhyloFisher comes with a manually curated database of 240 protein‐coding genes, a subset of a previous phylogenetic dataset sampled from 304 eukaryotic taxa. The software package can also utilize a user‐created database of eukaryotic proteins, which may be more appropriate for shallow evolutionary questions. PhyloFisher is also equipped with a set of utilities to aid in running routine analyses, such as the prediction of alternative genetic codes, removal of genes and/or taxa based on occupancy/completeness of the dataset, testing for amino acid compositional heterogeneity among sequences, removal of heterotachious and/or fast‐evolving sites, removal of fast‐evolving taxa, supermatrix creation from randomly resampled genes, and supermatrix creation from nucleotide sequences. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Constructing a phylogenomic dataset Basic Protocol 2: Performing phylogenomic analyses Support Protocol 1: Installing PhyloFisher Support Protocol 2: Creating a custom phylogenomic database 
    more » « less
  2. Hejnol, Andreas (Ed.)
    Phylogenomic analyses of hundreds of protein-coding genes aimed at resolving phylogenetic relationships is now a common practice. However, no software currently exists that includes tools for dataset construction and subsequent analysis with diverse validation strategies to assess robustness. Furthermore, there are no publicly available high-quality curated databases designed to assess deep (>100 million years) relationships in the tree of eukaryotes. To address these issues, we developed an easy-to-use software package, PhyloFisher ( https://github.com/TheBrownLab/PhyloFisher ), written in Python 3. PhyloFisher includes a manually curated database of 240 protein-coding genes from 304 eukaryotic taxa covering known eukaryotic diversity, a novel tool for ortholog selection, and utilities that will perform diverse analyses required by state-of-the-art phylogenomic investigations. Through phylogenetic reconstructions of the tree of eukaryotes and of the Saccharomycetaceae clade of budding yeasts, we demonstrate the utility of the PhyloFisher workflow and the provided starting database to address phylogenetic questions across a large range of evolutionary time points for diverse groups of organisms. We also demonstrate that undetected paralogy can remain in phylogenomic “single-copy orthogroup” datasets constructed using widely accepted methods such as all vs. all BLAST searches followed by Markov Cluster Algorithm (MCL) clustering and application of automated tree pruning algorithms. Finally, we show how the PhyloFisher workflow helps detect inadvertent paralog inclusions, allowing the user to make more informed decisions regarding orthology assignments, leading to a more accurate final dataset. 
    more » « less
  3. Abstract Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without prespecified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multilocus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data. [Deep learning; gene tree discordance; metagenomics; microbiome analyses; neural networks; phylogenetic placement.] 
    more » « less
  4. Blaimer, Bonnie (Ed.)
    Abstract A rapid proliferation in the availability of whole genome sequences (WGS), often with relatively low read depth, offers an unprecedented opportunity for phylogenomic advances using publicly available data, but there are several key challenges in applying these data. Using low‐coverage WGS data for the ant species ofFormica, we conducted detailed comparisons on two different analytical pipelines (reference‐based vs. de novo genome assembly), four types of datasets (5‐kbp‐window, ultra‐conserved element [UCE], single‐copy ortholog [BUSCO] and mitogenome), and a series of analytical procedures (e.g. concatenation vs. coalescent analyses) to identify which are robust to typical WGS data. The results show that at a shallow scale of phylogenetic relationships of closely related species 5‐kbp‐windows from the reference‐based pipeline and UCEs from the de novo assemblies are more successful than the BUSCOs in recovering informative markers for phylogenetic inference. Compared with concatenation analyses, coalescent analyses often resulted in disparate deeper relationships in the phylogeny. This study also uncovers evident mito‐nuclear discordance and demonstrates genome‐wide gene conflicts in phylogenetic signals, both pointing to possible incomplete lineage sorting and/or hybridization during the early, rapid radiation ofFormicaants. Divergence dating analyses show that different types of data and analytical methods could result in inconsistent time estimates, highlighting the potential need for multiple approaches to better understand species divergence. The strengths and weaknesses of different analytical pipelines and strategies are discussed. Findings from this study provide valuable insights for large‐scale phylogenomic projects using WGS data. 
    more » « less
  5. PhyloFisher is a software package, written in Python3, that contains a protocol designed for phylogenomic dataset assembly and data exploration. This software package aids in the construction and curation of protein sequence-based phylogenomic datasets, conducts post-assembly analyses, and allows visualization of the results. In addition, PhyloFisher currently includes a manually curated starting dataset of 240 proteins from 304 eukaryotic taxa representing the full breadth of known diversity in the eukaryotic tree of life. Importantly, this dataset also includes identified paralogs of each of the 240 proteins from all investigated taxa which is crucial for the identification of probable orthologs. Although PhyloFisher includes this pan-eukaryotic dataset, the tool is flexible and can work with any dataset consisting of protein sequences derived from eukaryotes. The combination of all of the foregoing features makes PhyloFisher a broadly-useful, user-friendly software tool for sophisticated phylogenomic analyses of eukaryotes.</div></div>PROJECT WEBSITE: http://amoeba.msstate.edu/phylofisher/ </div>PROJECT GITHUB: http://github.com/TheBrownLab/PhyloFisher</div></div>This dataset contains files for endusers to retrieve for installation of PhyloFisher as well as accompanying data from the PhyloFisher manuscript.</div></div>Tice_etal.PhyloFisher.archives.tar.gz | Installation requirements for PIP installation</div>Tice_etal.PhyloFisher1.FINAL_DATASET_RENAMED.tar.gz | File dataset associated with the manuscript including matrices and phylogenetic analyses</div>Tice_etal.PhyloFisher_v1.0_input_proteomes_LongNames.tar.gz | Input proteome data from taxa that was used to construct PhyloFisher v1.0</div>Tice_etal.PhyloFisherDatabase_v1.0_Jan.28.2021.tar.gz | PhyloFisher v1.0 starting database</div>Tice_etal.PhyloFisher_FOR_CUSTOM_DATASET_Jan.28.2021.tar.gz | Necessary files and directory structure to be used in custom database construction.</div>Tice_etal.PhyloFisher.DATA.tgz | All data associated with the figures (Fig 3, 4, A-Y) along with all phylogenomic trees and analyses. </div></div></div> 
    more » « less