skip to main content


Title: Modern simulation utilities for genetic analysis
Abstract Background

Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools.

Results

We present , an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research.

Conclusion

The package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.

 
more » « less
Award ID(s):
2054253
NSF-PAR ID:
10225766
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Volume:
22
Issue:
1
ISSN:
1471-2105
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    This article introduces Vivarium—software born of the idea that it should be as easy as possible for computational biologists to define any imaginable mechanistic model, combine it with existing models and execute them together as an integrated multiscale model. Integrative multiscale modeling confronts the complexity of biology by combining heterogeneous datasets and diverse modeling strategies into unified representations. These integrated models are then run to simulate how the hypothesized mechanisms operate as a whole. But building such models has been a labor-intensive process that requires many contributors, and they are still primarily developed on a case-by-case basis with each project starting anew. New software tools that streamline the integrative modeling effort and facilitate collaboration are therefore essential for future computational biologists.

    Results

    Vivarium is a software tool for building integrative multiscale models. It provides an interface that makes individual models into modules that can be wired together in large composite models, parallelized across multiple CPUs and run with Vivarium’s discrete-event simulation engine. Vivarium’s utility is demonstrated by building composite models that combine several modeling frameworks: agent-based models, ordinary differential equations, stochastic reaction systems, constraint-based models, solid-body physics and spatial diffusion. This demonstrates just the beginning of what is possible—Vivarium will be able to support future efforts that integrate many more types of models and at many more biological scales.

    Availability and implementation

    The specific models, simulation pipelines and notebooks developed for this article are all available at the vivarium-notebooks repository: https://github.com/vivarium-collective/vivarium-notebooks. Vivarium-core is available at https://github.com/vivarium-collective/vivarium-core, and has been released on Python Package Index. The Vivarium Collective (https://vivarium-collective.github.io) is a repository of freely available Vivarium processes and composites, including the processes used in Section 3. Supplementary Materials provide with an extensive methodology section, with several code listings that demonstrate the basic interfaces.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract

    We propose a Bayesian model selection approach for generalized linear mixed models (GLMMs). We consider covariance structures for the random effects that are widely used in areas such as longitudinal studies, genome-wide association studies, and spatial statistics. Since the random effects cannot be integrated out of GLMMs analytically, we approximate the integrated likelihood function using a pseudo-likelihood approach. Our Bayesian approach assumes a flat prior for the fixed effects and includes both approximate reference prior and half-Cauchy prior choices for the variances of random effects. Since the flat prior on the fixed effects is improper, we develop a fractional Bayes factor approach to obtain posterior probabilities of the several competing models. Simulation studies with Poisson GLMMs with spatial random effects and overdispersion random effects show that our approach performs favorably when compared to widely used competing Bayesian methods including deviance information criterion and Watanabe–Akaike information criterion. We illustrate the usefulness and flexibility of our approach with three case studies including a Poisson longitudinal model, a Poisson spatial model, and a logistic mixed model. Our proposed approach is implemented in the R package GLMMselect that is available on CRAN.

     
    more » « less
  3. Abstract Objectives

    The use of dental metrics in phylogenetic reconstructions of fossil primates assumes variation in tooth size is highly heritable. Quantitative genetic studies in humans and baboons have estimated high heritabilities for dental traits, providing a preliminary view of the variability of dental trait heritability in nonhuman primate species. To expand upon this view, the heritabilities and evolvabilities of linear dental dimensions are estimated in brown‐mantled tamarins (Saguinus fuscicollis) and rhesus macaques (Macaca mulatta).

    Materials and methods

    Quantitative genetic analyses were performed on linear dental dimensions collected from 302 brown‐mantled tamarins and 364 rhesus macaques. Heritabilities were estimated in SOLAR using pedigrees from each population, and evolvabilities were calculated manually.

    Results

    Tamarin heritability estimates range from 0.19 to 0.99, and 25 of 26 tamarin estimates are significantly different from zero. Macaque heritability estimates range from 0.08 to 1.00, and 25 out of 28 estimates are significantly different from zero.

    Discussion

    Dental dimensions are highly heritable in captive brown‐mantled tamarins and free‐ranging rhesus macaques. The range of heritability estimates in these populations is broadly similar to those of baboons and humans. Evolvability tends to increase with heritability, although evolvability is high relative to heritability in some dimensions. Estimating evolvability helps to contextualize differences in heritability, and the observed relationship between evolvability and heritability in dental dimensions requires further investigation.

     
    more » « less
  4. Abstract Key message

    A population of lettuce that segregated for photoperiod sensitivity was planted under long-day and short-day conditions. Genetic mapping revealed two distinct sets of QTLs controlling daylength-independent and photoperiod-sensitive flowering time.

    Abstract

    The molecular mechanism of flowering time regulation in lettuce is of interest to both geneticists and breeders because of the extensive impact of this trait on agricultural production. Lettuce is a facultative long-day plant which changes in flowering time in response to photoperiod. Variations exist in both flowering time and the degree of photoperiod sensitivity among accessions of wild (Lactuca serriola) and cultivated (L. sativa) lettuce. An F6population of 236 recombinant inbred lines (RILs) was previously developed from a cross between a late-flowering, photoperiod-sensitiveL. serriolaaccession and an early-flowering, photoperiod-insensitiveL. sativaaccession. This population was planted under long-day (LD) and short-day (SD) conditions in a total of four field and screenhouse trials; the developmental phenotype was scored weekly in each trial. Using genotyping-by-sequencing (GBS) data of the RILs, quantitative trait loci (QTL) mapping revealed five flowering time QTLs that together explained more than 20% of the variation in flowering time under LD conditions. Using two independent statistical models to extract the photoperiod sensitivity phenotype from the LD and SD flowering time data, we identified an additional five QTLs that together explained more than 30% of the variation in photoperiod sensitivity in the population. Orthology and sequence analysis of genes within the nine QTLs revealed potential functional equivalents in the lettuce genome to the key regulators of flowering time and photoperiodism,FDandCONSTANS, respectively, in Arabidopsis.

     
    more » « less
  5. Abstract Summary

    With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less