ABSTRACT Phylogenies built from multiple genes have become a common component of evolutionary biology studies. Molecular phylogenomic matrices used to build multi-gene phylogenies can be built from either nucleotide or protein matrices. Nucleotide-based analyses are often more appropriate for addressing phylogenetic questions in evolutionarily shallow timescales (i.e., less than 100 million years) while protein-based analyses are often more appropriate for addressing deep phylogenetic questions. PhyloFisher is a phylogenomic software package written in Python3. The manually curated PhyloFisher database contains 240 protein-coding genes from 304 eukaryotic taxa. Here we presentnucl_matrix_constructor.py, an expansion of the PhyloFisher starting database, and an update to PhyloFisher that maintains DNA sequences. This combination will allow users the ability to easily build nucleotide phylogenomic matrices while retaining the benefits of protein-based pre-processing used to identify contaminants and paralogy.
more »
« less
This content will become publicly available on August 1, 2026
Phyling: phylogenetic inference from annotated genomes
Phyling is a fast, scalable, and user-friendly tool supporting phylogenomic reconstruction of species phylogenies directly from protein-encoded genomic data. It identifies orthologous genes by searching a sample's protein sequences against a Hidden Markov Models marker set, containing single-copy orthologs, retrieved from the BUSCO database. In the final step, users can choose between consensus and concatenation strategies to construct the species tree from the aligned orthologs. Phyling efficiently resolves large phylogenies by optimizing memory usage and data processing. Its checkpoint system enables users to incrementally add or remove samples without repeating the entire search process. For analyses involving closely related taxa, Phyling supports the use of nucleotide coding sequences, which may capture phylogenetic signals missed by protein sequences. The benchmark results show that Phyling substantially runs faster than OrthoFinder, a Reciprocal Best Hit based method, while achieving equal or better accuracy.
more »
« less
- Award ID(s):
- 2215705
- PAR ID:
- 10656443
- Publisher / Repository:
- bioRxiv
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The history of lamprey evolution has been contentious due to limited morphological differentiation and limited genetic data. Available data has produced inconsistent results, including in the relationship among northern and southern species and the monophyly of putative clades. Here we use whole genome sequence data sourced from a public database to identify orthologs for 11 lamprey species from across the globe and build phylogenies. The phylogeny showed a clear separation between northern and southern lamprey species, which contrasts with some prior work. We also find that the phylogenetic relationships of our samples of two genera, Lethenteron and Eudontomyzon, deviate from the taxonomic classification of these species, suggesting that they require reclassification.more » « less
-
Abstract BackgroundUniversal single-copy orthologs are the most conserved components of genomes. Although they are routinely used for studying evolutionary histories and assessing new assemblies, current methods do not incorporate information from available genomic data. ResultsHere, we first determine the influence of evolutionary history on universal gene content and find that across 11,098 genomes of plants, fungi, and animals comprising 2606 taxonomic groups, 215 groups significantly vary from their respective lineages in terms of BUSCO (Benchmarking Universal Single Copy Orthologs) completeness. Additionally, 169 groups display an elevated complement of duplicated orthologs, likely from ancestral whole genome duplication events. Secondly, we investigate the extent of taxonomic congruence in broad BUSCO-derived phylogenies. For 275 suitable families out of 543 tested, sites evolving at higher rates produce at most 23.84% more taxonomically concordant, and at least 46.15% less terminally variable phylogenies compared to lower-rate sites. We find that BUSCO concatenated and coalescent trees have comparable accuracy and conclude that higher rate sites from concatenated alignments produce the most congruent and least variable phylogenies. Finally, we show that undetected, yet pervasive BUSCO gene loss events lead to misrepresentations of assembly quality. To overcome this, we filter a Curated set of BUSCOs (CUSCOs) that provide up to 6.99% fewer false positives compared to the standard search and introduce novel methods for comparing assemblies using gene synteny. ConclusionsOverall, we highlight the importance of considering evolutionary histories during assembly evaluations and release the phyca software toolkit that reconstructs consistent phylogenies and offers more precise assembly assessments.more » « less
-
Abstract Pesticidal proteins derived from the bacterium Bacillus thuringiensis, have provided the bases for a diverse array of pest management tools ranging from natural products used in organic agriculture, to modern biotechnological approaches. With advances in genome sequencing technologies and protein structure determination, an increasing number of pesticidal proteins from myriad bacterial species have been identified. The Bacterial Pesticidal Protein Resource Center (BPPRC) has been established to provide informational and analytical resources on the wide range of pesticidal proteins derived from bacteria that have potential utility for arthropod management. In association with a revised nomenclature for these proteins, BPPRC contains a database that allows users to browse and download sequences. Users can search the database for the best matches to sequences of interest and can incorporate their own sequences into basic informatic analyses. These analyses include the ability to draw and export guide trees from either whole protein sequences or, in the case of the three-domain Cry proteins, from individual domains. The associated website also provides a portal for users to submit protein sequences for naming. The BPPRC provides a single authoritative source of information to which all stakeholders can be referred including academics, government regulatory bodies and research and development personnel in the industrial sector. The database provides information on more than 1060 pesticidal proteins derived from 13 species of bacteria, including insecticidal activities for a subset of these proteins. Database URL: www.bpprc.org and www.bpprc-db.org/more » « less
-
Abstract Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package TopicContml. The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation) to extract topic frequencies from k-mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program Contml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of TopicContml on simulated datasets with gaps and three biological datasets: 1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, 2) 5162 loci from 80 mammal species, and 3) raw, unaligned, nonorthologous PacBio sequences from 12 bird species. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure. Our empirical results and simulated data suggest that our method is efficient and statistically robust.more » « less
An official website of the United States government
