PhyloFisher is a software package written primarily in Python3 that can be used for the creation, analysis, and visualization of phylogenomic datasets that consist of protein sequences from eukaryotic organisms. Unlike many existing phylogenomic pipelines, PhyloFisher comes with a manually curated database of 240 protein‐coding genes, a subset of a previous phylogenetic dataset sampled from 304 eukaryotic taxa. The software package can also utilize a user‐created database of eukaryotic proteins, which may be more appropriate for shallow evolutionary questions. PhyloFisher is also equipped with a set of utilities to aid in running routine analyses, such as the prediction of alternative genetic codes, removal of genes and/or taxa based on occupancy/completeness of the dataset, testing for amino acid compositional heterogeneity among sequences, removal of heterotachious and/or fast‐evolving sites, removal of fast‐evolving taxa, supermatrix creation from randomly resampled genes, and supermatrix creation from nucleotide sequences. © 2024 Wiley Periodicals LLC.
Redefining Possible: Combining Phylogenomic and Supersparse Data in Frogs
Abstract The data available for reconstructing molecular phylogenies have become wildly disparate. Phylogenomic studies can generate data for thousands of genetic markers for dozens of species, but for hundreds of other taxa, data may be available from only a few genes. Can these two types of data be integrated to combine the advantages of both, addressing the relationships of hundreds of species with thousands of genes? Here, we show that this is possible, using data from frogs. We generated a phylogenomic data set for 138 ingroup species and 3,784 nuclear markers (ultraconserved elements [UCEs]), including new UCE data from 70 species. We also assembled a supermatrix data set, including data from 97% of frog genera (441 total), with 1–307 genes per taxon. We then produced a combined phylogenomic–supermatrix data set (a “gigamatrix”) containing 441 ingroup taxa and 4,091 markers but with 86% missing data overall. Likelihood analysis of the gigamatrix yielded a generally well-supported tree among families, largely consistent with trees from the phylogenomic data alone. All terminal taxa were placed in the expected families, even though 42.5% of these taxa each had >99.5% missing data and 70.2% had >90% missing data. Our results show that missing data need not be an impediment to successfully combining very large phylogenomic and supermatrix data sets, and they open the door to new studies that simultaneously maximize sampling of genes and taxa.
more »
« less
- Award ID(s):
- 1655812
- PAR ID:
- 10448830
- Editor(s):
- O'Connell, Mary
- Date Published:
- Journal Name:
- Molecular Biology and Evolution
- Volume:
- 40
- Issue:
- 5
- ISSN:
- 0737-4038
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract Basic Protocol 1 : Constructing a phylogenomic datasetBasic Protocol 2 : Performing phylogenomic analysesSupport Protocol 1 : Installing PhyloFisherSupport Protocol 2 : Creating a custom phylogenomic database -
Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from dozens, hundreds, or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e. removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these datasets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data (∼5,000 loci) and subsampled datasets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic datasets (e.g. length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses.more » « less
-
Hejnol, Andreas (Ed.)Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a s plitti n g a nd p runing procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.more » « less
-
Kolodny, Rachel (Ed.)Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences ( MHGs ) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.more » « less
-
null (Ed.)Since the initial success of genome-wide association studies (GWAS) in 2005, tens of thousands of genetic variants have been identified for hundreds of human diseases and traits. In a GWAS, genotype information at up to millions of genetic markers is collected from up to hundreds of thousands of individuals, together with their phenotype information. Several scientific goals can be accomplished through the analysis of GWAS data, including the identification of variants, genes, and pathways associated with diseases and traits of interest; the inference of the genetic architecture of these traits; and the development of genetic risk prediction models. In this review, we provide an overview of the statistical challenges in achieving these goals and recent progress in statistical methodology to address these challenges.more » « less