skip to main content


Title: PyDREAM: high-dimensional parameter inference for biological models in python
Abstract Summary

Biological models contain many parameters whose values are difficult to measure directly via experimentation and therefore require calibration against experimental data. Markov chain Monte Carlo (MCMC) methods are suitable to estimate multivariate posterior model parameter distributions, but these methods may exhibit slow or premature convergence in high-dimensional search spaces. Here, we present PyDREAM, a Python implementation of the (Multiple-Try) Differential Evolution Adaptive Metropolis [DREAM(ZS)] algorithm developed by Vrugt and ter Braak (2008) and Laloy and Vrugt (2012). PyDREAM achieves excellent performance for complex, parameter-rich models and takes full advantage of distributed computing resources, facilitating parameter inference and uncertainty estimation of CPU-intensive biological models.

Availability and implementation

PyDREAM is freely available under the GNU GPLv3 license from the Lopez lab GitHub repository at http://github.com/LoLab-VU/PyDREAM.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
PAR ID:
10393385
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
34
Issue:
4
ISSN:
1367-4803
Page Range / eLocation ID:
p. 695-697
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.

    Availability and implementation

    Our software is available open source at https://github.com/nishatbristy007/NSB.

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

     
    more » « less
  2. Abstract Motivation

    Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models.

    Results

    Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes.

    Availability and implementation

    The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Summary

    FunImageJ is a Lisp framework for scientific image processing built upon the ImageJ software ecosystem. The framework provides a natural functional-style for programming, while accounting for the performance requirements necessary in big data processing commonly encountered in biological image analysis.

    Availability and implementation

    Freely available plugin to Fiji (http://fiji.sc/#download). Installation and use instructions available at http://imagej.net/FunImageJ.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    The relative rates of amino acid interchanges over evolutionary time are likely to vary among proteins. Variation in those rates has the potential to reveal information about constraints on proteins. However, the most straightforward model that could be used to estimate relative rates of amino acid substitution is parameter-rich and it is therefore impractical to use for this purpose.

    Results

    A six-parameter model of amino acid substitution that incorporates information about the physicochemical properties of amino acids was developed. It showed that amino acid side chain volume, polarity and aromaticity have major impacts on protein evolution. It also revealed variation among proteins in the relative importance of those properties. The same general approach can be used to improve the fit of empirical models such as the commonly used PAM and LG models.

    Availability and implementation

    Perl code and test data are available from https://github.com/ebraun68/sixparam.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    The estimation of phylogenetic trees is a major part of many biological dataset analyses, but maximum likelihood approaches are NP-hard and Bayesian MCMC methods do not scale well to even moderate-sized datasets. Supertree methods, which are used to construct trees from trees computed on subsets, are critically important tools for enabling the statistical estimation of phylogenies for large and potentially heterogeneous datasets. Supertree estimation is itself NP-hard, and no current supertree method has sufficient accuracy and scalability to provide good accuracy on the large datasets that supertree methods were designed for, containing thousands of species and many subset trees.

    Results

    We present FastRFS, a new method based on a dynamic programming method we have developed to find an exact solution to the Robinson-Foulds Supertree problem within a constrained search space. FastRFS has excellent accuracy in terms of criterion scores and topological accuracy of the resultant trees, substantially improving on competing methods on a large collection of biological and simulated data. In addition, FastRFS is extremely fast, finishing in minutes on even very large datasets, and in under an hour on a biological dataset with 2228 species.

    Availability and Implementation

    FastRFS is available on github at https://github.com/pranjalv123/FastRFS

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less