skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Associated with PhyloFisher
PhyloFisher is a software package, written in Python3, that contains a protocol designed for phylogenomic dataset assembly and data exploration. This software package aids in the construction and curation of protein sequence-based phylogenomic datasets, conducts post-assembly analyses, and allows visualization of the results. In addition, PhyloFisher currently includes a manually curated starting dataset of 240 proteins from 304 eukaryotic taxa representing the full breadth of known diversity in the eukaryotic tree of life. Importantly, this dataset also includes identified paralogs of each of the 240 proteins from all investigated taxa which is crucial for the identification of probable orthologs. Although PhyloFisher includes this pan-eukaryotic dataset, the tool is flexible and can work with any dataset consisting of protein sequences derived from eukaryotes. The combination of all of the foregoing features makes PhyloFisher a broadly-useful, user-friendly software tool for sophisticated phylogenomic analyses of eukaryotes.</div></div>PROJECT WEBSITE: http://amoeba.msstate.edu/phylofisher/ </div>PROJECT GITHUB: http://github.com/TheBrownLab/PhyloFisher</div></div>This dataset contains files for endusers to retrieve for installation of PhyloFisher as well as accompanying data from the PhyloFisher manuscript.</div></div>Tice_etal.PhyloFisher.archives.tar.gz | Installation requirements for PIP installation</div>Tice_etal.PhyloFisher1.FINAL_DATASET_RENAMED.tar.gz | File dataset associated with the manuscript including matrices and phylogenetic analyses</div>Tice_etal.PhyloFisher_v1.0_input_proteomes_LongNames.tar.gz | Input proteome data from taxa that was used to construct PhyloFisher v1.0</div>Tice_etal.PhyloFisherDatabase_v1.0_Jan.28.2021.tar.gz | PhyloFisher v1.0 starting database</div>Tice_etal.PhyloFisher_FOR_CUSTOM_DATASET_Jan.28.2021.tar.gz | Necessary files and directory structure to be used in custom database construction.</div>Tice_etal.PhyloFisher.DATA.tgz | All data associated with the figures (Fig 3, 4, A-Y) along with all phylogenomic trees and analyses. </div></div></div>  more » « less
Award ID(s):
2100888
PAR ID:
10320935
Author(s) / Creator(s):
Publisher / Repository:
figshare
Date Published:
Subject(s) / Keyword(s):
Evolutionary Biology
Format(s):
Medium: X Size: 9166077571 Bytes
Size(s):
9166077571 Bytes
Sponsoring Org:
National Science Foundation
More Like this
  1. Hejnol, Andreas (Ed.)
    Phylogenomic analyses of hundreds of protein-coding genes aimed at resolving phylogenetic relationships is now a common practice. However, no software currently exists that includes tools for dataset construction and subsequent analysis with diverse validation strategies to assess robustness. Furthermore, there are no publicly available high-quality curated databases designed to assess deep (>100 million years) relationships in the tree of eukaryotes. To address these issues, we developed an easy-to-use software package, PhyloFisher ( https://github.com/TheBrownLab/PhyloFisher ), written in Python 3. PhyloFisher includes a manually curated database of 240 protein-coding genes from 304 eukaryotic taxa covering known eukaryotic diversity, a novel tool for ortholog selection, and utilities that will perform diverse analyses required by state-of-the-art phylogenomic investigations. Through phylogenetic reconstructions of the tree of eukaryotes and of the Saccharomycetaceae clade of budding yeasts, we demonstrate the utility of the PhyloFisher workflow and the provided starting database to address phylogenetic questions across a large range of evolutionary time points for diverse groups of organisms. We also demonstrate that undetected paralogy can remain in phylogenomic “single-copy orthogroup” datasets constructed using widely accepted methods such as all vs. all BLAST searches followed by Markov Cluster Algorithm (MCL) clustering and application of automated tree pruning algorithms. Finally, we show how the PhyloFisher workflow helps detect inadvertent paralog inclusions, allowing the user to make more informed decisions regarding orthology assignments, leading to a more accurate final dataset. 
    more » « less
  2. Abstract PhyloFisher is a software package written primarily in Python3 that can be used for the creation, analysis, and visualization of phylogenomic datasets that consist of protein sequences from eukaryotic organisms. Unlike many existing phylogenomic pipelines, PhyloFisher comes with a manually curated database of 240 protein‐coding genes, a subset of a previous phylogenetic dataset sampled from 304 eukaryotic taxa. The software package can also utilize a user‐created database of eukaryotic proteins, which may be more appropriate for shallow evolutionary questions. PhyloFisher is also equipped with a set of utilities to aid in running routine analyses, such as the prediction of alternative genetic codes, removal of genes and/or taxa based on occupancy/completeness of the dataset, testing for amino acid compositional heterogeneity among sequences, removal of heterotachious and/or fast‐evolving sites, removal of fast‐evolving taxa, supermatrix creation from randomly resampled genes, and supermatrix creation from nucleotide sequences. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Constructing a phylogenomic dataset Basic Protocol 2: Performing phylogenomic analyses Support Protocol 1: Installing PhyloFisher Support Protocol 2: Creating a custom phylogenomic database 
    more » « less
  3. ABSTRACT Phylogenies built from multiple genes have become a common component of evolutionary biology studies. Molecular phylogenomic matrices used to build multi-gene phylogenies can be built from either nucleotide or protein matrices. Nucleotide-based analyses are often more appropriate for addressing phylogenetic questions in evolutionarily shallow timescales (i.e., less than 100 million years) while protein-based analyses are often more appropriate for addressing deep phylogenetic questions. PhyloFisher is a phylogenomic software package written in Python3. The manually curated PhyloFisher database contains 240 protein-coding genes from 304 eukaryotic taxa. Here we presentnucl_matrix_constructor.py, an expansion of the PhyloFisher starting database, and an update to PhyloFisher that maintains DNA sequences. This combination will allow users the ability to easily build nucleotide phylogenomic matrices while retaining the benefits of protein-based pre-processing used to identify contaminants and paralogy. 
    more » « less
  4. {"Abstract":["A biodiversity dataset graph: DataONE<\/p>\n\nThe intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]). DataONE is a distributed infrastructure that provides information about earth observation data. <\/p>\n\nThis dataset provides versioned snapshots of the DataONE network as tracked by Preston [2] between 17 October 2018 and 7 July 2019.  <\/p>\n\nThe archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the DataONE meta-data files were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543).  <\/p>\n\nTo retrieve and verify the downloaded DataONE biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. Alternatively, you can use the preston[2] command-line tool to "clone" this dataset using:<\/p>\n\n$$ java -jar preston.jar clone --remote https://zenodo.org/record/3277312/files<\/p>\n\nAfter that, verify the index of the archive by reproducing the following result:<\/p>\n\n$$ java -jar preston.jar history\n<0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f> .\n<hash://sha256/3ed3acaca7ac57f546d0b8877c1927ab5e08c23eccaa8219600c59c77a72c685> <http://purl.org/pav/previousVersion> <hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f> .\n<hash://sha256/857753997a7595a1b372b05641b58a25d9408b7ff08d557ce1fe8b73e4bd383f> <http://purl.org/pav/previousVersion> <hash://sha256/3ed3acaca7ac57f546d0b8877c1927ab5e08c23eccaa8219600c59c77a72c685> .\n<hash://sha256/7ee0376f4c3f7aeeda36927a5211395e5da8201e810e8c7e638a0fe23d001e88> <http://purl.org/pav/previousVersion> <hash://sha256/857753997a7595a1b372b05641b58a25d9408b7ff08d557ce1fe8b73e4bd383f> .\n<hash://sha256/68b4974d8ab7c4c7a7a4305065839b60ba460aaa862590b34c67877738feba90> <http://purl.org/pav/previousVersion> <hash://sha256/7ee0376f4c3f7aeeda36927a5211395e5da8201e810e8c7e638a0fe23d001e88> .\n<hash://sha256/060a76d56255bf9482c951748c91291fddeeb20f180632132be1344e081b2372> <http://purl.org/pav/previousVersion> <hash://sha256/68b4974d8ab7c4c7a7a4305065839b60ba460aaa862590b34c67877738feba90> .\n<hash://sha256/29357bdfab4548025f8a5743301f5c3c9146fa436c39e3c9e019fb9409ac9c42> <http://purl.org/pav/previousVersion> <hash://sha256/060a76d56255bf9482c951748c91291fddeeb20f180632132be1344e081b2372> .\n<hash://sha256/3669cd95100d1d533eb8953ff4ec5092cbd8addb8879b3e6262191148a8a3ebb> <http://purl.org/pav/previousVersion> <hash://sha256/29357bdfab4548025f8a5743301f5c3c9146fa436c39e3c9e019fb9409ac9c42> .\n<hash://sha256/8dc1663299359d271cb1b4c14ad521d0f1be67743689dd18016543dc1e097efb> <http://purl.org/pav/previousVersion> <hash://sha256/3669cd95100d1d533eb8953ff4ec5092cbd8addb8879b3e6262191148a8a3ebb> .\n<hash://sha256/dc4903e8afee651db1d9bf509f20503bf9c8e89679c4bcffb46d5b97440cb6de> <http://purl.org/pav/previousVersion> <hash://sha256/8dc1663299359d271cb1b4c14ad521d0f1be67743689dd18016543dc1e097efb> .<\/p>\n\nTo check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.<\/p>\n\n$ java -jar preston.jar verify\nhash://sha256/e55c1034d985740926564e94decd6dc7a70f779a33e7deb931553739cda16945    file:/home/preston/preston-dataone/data/e5/5c/e55c1034d985740926564e94decd6dc7a70f779a33e7deb931553739cda16945    OK    CONTENT_PRESENT_VALID_HASH    21580\nhash://sha256/d0ddcc2111b6134a570bcc7d89375920ef4d754130cecc0727c79d2b05a9f81f    file:/home/preston/preston-dataone/data/d0/dd/d0ddcc2111b6134a570bcc7d89375920ef4d754130cecc0727c79d2b05a9f81f    OK    CONTENT_PRESENT_VALID_HASH    2035\nhash://sha256/472de9d1c9fd7e044aac409abfbfff9f12c6b69359df995d431009580ffb0f53    file:/home/preston/preston-dataone/data/47/2d/472de9d1c9fd7e044aac409abfbfff9f12c6b69359df995d431009580ffb0f53    OK    CONTENT_PRESENT_VALID_HASH    1935\nhash://sha256/b29879462cd43862129c5cf9b149c41ecd33ffef284a4dbea4ac1c0f90108687    file:/home/preston/preston-dataone/data/b2/98/b29879462cd43862129c5cf9b149c41ecd33ffef284a4dbea4ac1c0f90108687    OK    CONTENT_PRESENT_VALID_HASH    1553<\/p>\n\nNote that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston". <\/p>\n\nFiles in this data publication:<\/p>\n\nREADME - this file<\/p>\n\npreston.jar - executable java jar containing preston[2] v0.1.1.<\/p>\n\npreston-[00-ff].tar.gz - preston archives containing DataONE meta-data files, their provenance and a provenance index.<\/p>\n\n2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file\n2aecaf289def0e23a27058bf7715f226ef9189905f0be13228174825633125cf - preston index file\n3d38b70198e448674be6a63d14b9817f3a956f48bba7418fa7baa086a56c05b7 - preston index file\n66ad3e5e904740f1e835ac6718dda4279e0c24b204ea0d1113cda1352a5072ba - preston index file\n8bf062872ce958545d361e9d53a552ffb025ac29ab875caad1157c0995d34f66 - preston index file\nd9378616636be3686bbabd5bf29d50f0ef0e5ceb5ddd7dfce47f7e755b596b7d - preston index file\nda26fa6e7371385ed3f61af9a766221c833060d59dfd4869bbd7110f95f288db - preston index file\ne4103a75627857de3ee2e317429108611c244fc448c01d1d7bf652115c3b8a55 - preston index file\neb368fedb8f100210dd968edcf80f4d13cab3dd64135a6ab744102cf15e68c94 - preston index file\nff92b6c06ae5286bd2f1db679e0fcc4da294acb9bc01b2e9522378d99218c2e3 - preston index file<\/p>\n\n[1] DataONE, https://www.dataone.org\n[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".<\/p>\n\nThis work is funded in part by grant NSF OAC 1839201 from the National Science Foundation<\/p>"]} 
    more » « less
  5. {"Abstract":["A biodiversity dataset graph: BHL<\/p>\n\nThe intended use of this archive is to facilitate (meta-)analysis of the Biodiversity Heritage Library (BHL). The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.<\/p>\n\nThis dataset provides versioned snapshots of the BHL network as tracked by Preston [2] between 2019-05-19 and 2020-05-09 using "preston update -u https://biodiversitylibrary.org".<\/p>\n\nThe archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance logs and data files. In addition, index files have been individually included in this dataset publication to facilitate remote access. Index files provide a way to links provenance files in time to establish a versioning mechanism. Provenance files describe how, when, what and where the BHL content was retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543 .  <\/p>\n\nTo retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. Alternatively, you can use the preston[2] command-line tool to "clone" this dataset using:<\/p>\n\n$$ java -jar preston.jar clone --remote https://zenodo.org/record/3849560/files<\/p>\n\nAfter that, verify the index of the archive by reproducing the following provenance log history:<\/p>\n\n$$ java -jar preston.jar history\n<0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a> .\n<hash://sha256/41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4> <http://purl.org/pav/previousVersion> <hash://sha256/89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a> .\n<hash://sha256/7582d5ba23e0d498ca4f55c29408c477d0d92b4fdcea139e8666f4d78c78a525> <http://purl.org/pav/previousVersion> <hash://sha256/41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4> .\n<hash://sha256/a70774061ccded1a45389b9e6063eb3abab3d42813aa812391f98594e7e26687> <http://purl.org/pav/previousVersion> <hash://sha256/7582d5ba23e0d498ca4f55c29408c477d0d92b4fdcea139e8666f4d78c78a525> .\n<hash://sha256/007e065ba4b99867751d688754aa3d33fa96e6e03133a2097e8a368d613cd93a> <http://purl.org/pav/previousVersion> <hash://sha256/a70774061ccded1a45389b9e6063eb3abab3d42813aa812391f98594e7e26687> .\n<hash://sha256/4fb4b4d8f1ae2961311fb0080e817adb2faa746e7eae15249a3772fbe2d662a1> <http://purl.org/pav/previousVersion> <hash://sha256/007e065ba4b99867751d688754aa3d33fa96e6e03133a2097e8a368d613cd93a> .\n<hash://sha256/67cc329e74fd669945f503917fbb942784915ab7810ddc41105a82ebe6af5482> <http://purl.org/pav/previousVersion> <hash://sha256/4fb4b4d8f1ae2961311fb0080e817adb2faa746e7eae15249a3772fbe2d662a1> .\n<hash://sha256/e46cd4b0d7fdb51ea789fa3c5f7b73591aca62d2d8f913346d71aa6cf0745c9f> <http://purl.org/pav/previousVersion> <hash://sha256/67cc329e74fd669945f503917fbb942784915ab7810ddc41105a82ebe6af5482> .\n<hash://sha256/9215d543418a80510e78d35a0cfd7939cc59f0143d81893ac455034b5e96150a> <http://purl.org/pav/previousVersion> <hash://sha256/e46cd4b0d7fdb51ea789fa3c5f7b73591aca62d2d8f913346d71aa6cf0745c9f> .\n<hash://sha256/1448656cc9f339b4911243d7c12f3ba5366b54fff3513640306682c50f13223d> <http://purl.org/pav/previousVersion> <hash://sha256/9215d543418a80510e78d35a0cfd7939cc59f0143d81893ac455034b5e96150a> .\n<hash://sha256/7ee6b16b7a5e9b364776427d740332d8552adf5041d48018eeb3c0e13ccebf27> <http://purl.org/pav/previousVersion> <hash://sha256/1448656cc9f339b4911243d7c12f3ba5366b54fff3513640306682c50f13223d> .\n<hash://sha256/34ccd7cf7f4a1ea35ac6ae26a458bb603b2f6ee8ad36e1a58aa0261105d630b1> <http://purl.org/pav/previousVersion> <hash://sha256/7ee6b16b7a5e9b364776427d740332d8552adf5041d48018eeb3c0e13ccebf27> .<\/p>\n\nTo check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.<\/p>\n\n$ java -jar preston.jar verify\nhash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca    file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca    OK    CONTENT_PRESENT_VALID_HASH    49458087    hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca\nhash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99    file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99    OK    CONTENT_PRESENT_VALID_HASH    25745    hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99\nhash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c    file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c    OK    CONTENT_PRESENT_VALID_HASH    519892    hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c\nhash://sha256/251e5032afce4f1e44bfdc5a8f0316ca1b317e8af41bdbf88163ab5bd2b52743    file:/home/preston/preston-bhl/data/25/1e/251e5032afce4f1e44bfdc5a8f0316ca1b317e8af41bdbf88163ab5bd2b52743    OK    CONTENT_PRESENT_VALID_HASH    787414    hash://sha256/251e5032afce4f1e44bfdc5a8f0316ca1b317e8af41bdbf88163ab5bd2b52743<\/p>\n\nNote that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".<\/p>\n\nFiles in this data publication:<\/p>\n\n--- start of file descriptions ---<\/p>\n\n-- description of archive and its contents (this file) --\nREADME<\/p>\n\n-- executable java jar containing preston[2] v0.1.15. --\npreston.jar<\/p>\n\n-- preston archives containing BHL data files, associated provenance logs and a provenance index --\npreston-[00-ff].tar.gz<\/p>\n\n-- individual provenance index files --\n2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a\n2b1104cb7749e818c9afca78391b2d0099bbb0a32f2b348860a335cd2f8f6800\n4081bc59dff58d63f6a86c623cb770f01e9a355a42495b205bcb538cd526190f\n47a2816f8b5600b24487093adcddfea12434cc4f270f3ab09d9215fbdd546cd2\n6f99a1388823fca745c9e22ac21e2da909a219aa1ace55170fa9248c0276903c\n7ae46d7cd9b5a0f5889ba38bac53c82e591b0bdf8b605f5e48c0dce8fb7b717f\n82903464889fea7c53f53daedf4e41fa31092f82619edeb3415eb2b473f74af3\n9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70\na8308fbf4530e287927c471d881ce0fc852f16543d46e1ee26f1caba48815f3a\nbcec6df2ea7f74e9a6e2830d0072e6b2fbe65323d9ddb022dd6e1349c23996e2\ncfe47c25ec0210ac73c06b407beb20d9c58355cb15bae427fdc7541870ca2e4e\nf73fc9e70bce8f21f0c96b8ef0903749d8f223f71343ab5a8910968f99c9b8b6<\/p>\n\n--- end of file descriptions ---<\/p>\n\n\nReferences<\/p>\n\n[1] Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org) accessed from 2019-05-19 to 2020-05-09 with provenance hash://sha256/34ccd7cf7f4a1ea35ac6ae26a458bb603b2f6ee8ad36e1a58aa0261105d630b1.\n[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 .<\/p>\n\n\nThis work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.<\/p>"]} 
    more » « less