ABSTRACT MotivationHere, we make available a second version of the BioTIME database, which compiles records of abundance estimates for species in sample events of ecological assemblages through time. The updated version expands version 1.0 of the database by doubling the number of studies and includes substantial additional curation to the taxonomic accuracy of the records, as well as the metadata. Moreover, we now provide an R package (BioTIMEr) to facilitate use of the database. Main Types of Variables IncludedThe database is composed of one main data table containing the abundance records and 11 metadata tables. The data are organised in a hierarchy of scales where 11,989,233 records are nested in 1,603,067 sample events, from 553,253 sampling locations, which are nested in 708 studies. A study is defined as a sampling methodology applied to an assemblage for a minimum of 2 years. Spatial Location and GrainSampling locations in BioTIME are distributed across the planet, including marine, terrestrial and freshwater realms. Spatial grain size and extent vary across studies depending on sampling methodology. We recommend gridding of sampling locations into areas of consistent size. Time Period and GrainThe earliest time series in BioTIME start in 1874, and the most recent records are from 2023. Temporal grain and duration vary across studies. We recommend doing sample‐level rarefaction to ensure consistent sampling effort through time before calculating any diversity metric. Major Taxa and Level of MeasurementThe database includes any eukaryotic taxa, with a combined total of 56,400 taxa. Software Formatcsv and. SQL.
more »
« less
Updated site concordance factors minimize effects of homoplasy and taxon sampling
Abstract MotivationSite concordance factors (sCFs) have become a widely used way to summarize discordance in phylogenomic datasets. However, the original version of sCFs was calculated by sampling a quartet of tip taxa and then applying parsimony-based criteria for discordance. This approach has the potential to be strongly affected by multiple hits at a site (homoplasy), especially when substitution rates are high or taxa are not closely related. ResultsHere, we introduce a new method for calculating sCFs. The updated version uses likelihood to generate probability distributions of ancestral states at internal nodes of the phylogeny. By sampling from the states at internal nodes adjacent to a given branch, this approach substantially reduces—but does not abolish—the effects of homoplasy and taxon sampling. Availability and implementationUpdated sCFs are implemented in IQ-TREE 2.2.2. The software is freely available at https://github.com/iqtree/iqtree2/releases. Supplementary informationSupplementary information is available at Bioinformatics online.
more »
« less
- Award ID(s):
- 1936187
- PAR ID:
- 10388943
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 39
- Issue:
- 1
- ISSN:
- 1367-4811
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationSpecies tree inference from multi-copy gene trees has long been a challenge in phylogenomics. The recent method ASTRAL-Pro has made strides by enabling multi-copy gene family trees as input and has been quickly adopted. Yet, its scalability, especially memory usage, needs to improve to accommodate the ever-growing dataset size. ResultsWe present ASTRAL-Pro 2, an ultrafast and memory efficient version of ASTRAL-Pro that adopts a placement-based optimization algorithm for significantly better scalability without sacrificing accuracy. Availability and implementationThe source code and binary files are publicly available at https://github.com/chaoszhang/ASTER; data are available at https://github.com/chaoszhang/A-Pro2_data. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract MotivationReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. ResultsIn this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementationWe implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Birol, Inanc (Ed.)Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. ResultsHere, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. Availability and implementationWeb server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Hancock, John (Ed.)Abstract SummaryChromosomal copy number variation (CNV) refers to a polymorphism that a DNA segment presents deletion or duplication in the population. The computational algorithms developed to identify this type of variation are usually of high computational complexity. Here we present a user-friendly R package, modSaRa, designed to perform copy number variants identification. The package is developed based on a change-point based method with optimal computational complexity and desirable accuracy. The current version of modSaRa package is a comprehensive tool with integration of preprocessing steps and main CNV calling steps. Availability and ImplementationmodSaRa is an R package written in R, C ++ and Rcpp and is now freely available for download at http://c2s2.yale.edu/software/modSaRa. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
An official website of the United States government
