skip to main content

Title: Modern simulation utilities for genetic analysis
Abstract Background

Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools.

Results

We present , an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data more » with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research.

Conclusion

The package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.

« less
Authors:
; ; ; ; ; ;
Award ID(s):
2054253
Publication Date:
NSF-PAR ID:
10225766
Journal Name:
BMC Bioinformatics
Volume:
22
Issue:
1
ISSN:
1471-2105
Publisher:
Springer Science + Business Media
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    This article introduces Vivarium—software born of the idea that it should be as easy as possible for computational biologists to define any imaginable mechanistic model, combine it with existing models and execute them together as an integrated multiscale model. Integrative multiscale modeling confronts the complexity of biology by combining heterogeneous datasets and diverse modeling strategies into unified representations. These integrated models are then run to simulate how the hypothesized mechanisms operate as a whole. But building such models has been a labor-intensive process that requires many contributors, and they are still primarily developed on a case-by-case basis with each project starting anew. New software tools that streamline the integrative modeling effort and facilitate collaboration are therefore essential for future computational biologists.

    Results

    Vivarium is a software tool for building integrative multiscale models. It provides an interface that makes individual models into modules that can be wired together in large composite models, parallelized across multiple CPUs and run with Vivarium’s discrete-event simulation engine. Vivarium’s utility is demonstrated by building composite models that combine several modeling frameworks: agent-based models, ordinary differential equations, stochastic reaction systems, constraint-based models, solid-body physics and spatial diffusion. This demonstrates just the beginning of what is possible—Vivarium willmore »be able to support future efforts that integrate many more types of models and at many more biological scales.

    Availability and implementation

    The specific models, simulation pipelines and notebooks developed for this article are all available at the vivarium-notebooks repository: https://github.com/vivarium-collective/vivarium-notebooks. Vivarium-core is available at https://github.com/vivarium-collective/vivarium-core, and has been released on Python Package Index. The Vivarium Collective (https://vivarium-collective.github.io) is a repository of freely available Vivarium processes and composites, including the processes used in Section 3. Supplementary Materials provide with an extensive methodology section, with several code listings that demonstrate the basic interfaces.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. Abstract Key message

    A population of lettuce that segregated for photoperiod sensitivity was planted under long-day and short-day conditions. Genetic mapping revealed two distinct sets of QTLs controlling daylength-independent and photoperiod-sensitive flowering time.

    Abstract

    The molecular mechanism of flowering time regulation in lettuce is of interest to both geneticists and breeders because of the extensive impact of this trait on agricultural production. Lettuce is a facultative long-day plant which changes in flowering time in response to photoperiod. Variations exist in both flowering time and the degree of photoperiod sensitivity among accessions of wild (Lactuca serriola) and cultivated (L. sativa) lettuce. An F6population of 236 recombinant inbred lines (RILs) was previously developed from a cross between a late-flowering, photoperiod-sensitiveL. serriolaaccession and an early-flowering, photoperiod-insensitiveL. sativaaccession. This population was planted under long-day (LD) and short-day (SD) conditions in a total of four field and screenhouse trials; the developmental phenotype was scored weekly in each trial. Using genotyping-by-sequencing (GBS) data of the RILs, quantitative trait loci (QTL) mapping revealed five flowering time QTLs that together explained more than 20% of the variation in flowering time under LD conditions. Using two independent statistical models to extract the photoperiod sensitivity phenotype from the LD and SD floweringmore »time data, we identified an additional five QTLs that together explained more than 30% of the variation in photoperiod sensitivity in the population. Orthology and sequence analysis of genes within the nine QTLs revealed potential functional equivalents in the lettuce genome to the key regulators of flowering time and photoperiodism,FDandCONSTANS, respectively, in Arabidopsis.

    « less
  3. Abstract Motivation

    This article presents libRoadRunner 2.0, an extensible, high-performance, cross-platform, open-source software library for the simulation and analysis of models expressed using the systems biology markup language (SBML).

    Results

    libRoadRunner is a self-contained library, able to run either as a component inside other tools via its C++, C and Python APIs, or interactively through its Python or Julia interface. libRoadRunner uses a custom just-in-time (JIT) compiler built on the widely used LLVM JIT compiler framework. It compiles SBML-specified models directly into native machine code for a large variety of processors, making it fast enough to simulate extremely large models or repeated runs in reasonable timeframes. libRoadRunner is flexible, supporting the bulk of the SBML specification (except for delay and non-linear algebraic equations) as well as several SBML extensions such as hierarchical composition and probability distributions. It offers multiple deterministic and stochastic integrators, as well as tools for steady-state, sensitivity, stability and structural analyses.

    Availability and implementation

    libRoadRunner binary distributions for Windows, Mac OS and Linux, Julia and Python bindings, source code and documentation are all available at https://github.com/sys-bio/roadrunner, and Python bindings are also available via pip. The source code can be compiled for the supported systems as well as in principle anymore »system supported by LLVM-13, such as ARM-based computers like the Raspberry Pi. The library is licensed under the Apache License Version 2.0.

    « less
  4. Abstract Background

    Genetic association studies that seek to explain the inheritance of complex traits typically fail to explain a majority of the heritability of the trait under study. Thus, we are left with a gap in the map from genotype to phenotype. Several approaches have been used to fill this gap, including those that attempt to map endophenotype such as the transcriptome, proteome or metabolome, that underlie complex traits. Here we used metabolomics to explore the nature of genetic variation for hydrogen peroxide (H2O2) resistance in the sequenced inbredDrosophilaGenetic Reference Panel (DGRP).

    Results

    We first studied genetic variation for H2O2resistance in 179 DGRP lines and along with identifying the insulin signaling modulatoru-shapedand several regulators of feeding behavior, we estimate that a substantial amount of phenotypic variation can be explained by a polygenic model of genetic variation. We then profiled a portion of the aqueous metabolome in subsets of eight ‘high resistance’ lines and eight ‘low resistance’ lines. We used these lines to represent collections of genotypes that were either resistant or sensitive to the stressor, effectively modeling a discrete trait. Across the range of genotypes in both populations, flies exhibited surprising consistency in their metabolomic signature of resistance. Importantly, the resistance phenotypemore »of these flies was more easily distinguished by their metabolome profiles than by their genotypes. Furthermore, we found a metabolic response to H2O2in sensitive, but not in resistant genotypes. Metabolomic data further implicated at least two pathways, glycogen and folate metabolism, as determinants of sensitivity to H2O2. We also discovered a confounding effect of feeding behavior on assays involving supplemented food.

    Conclusions

    This work suggests that the metabolome can be a point of convergence for genetic variation influencing complex traits, and can efficiently elucidate mechanisms underlying trait variation.

    « less
  5. Abstract Motivation

    CpG sites within the same genomic region often share similar methylation patterns and tend to be co-regulated by multiple genetic variants that may interact with one another.

    Results

    We propose a multi-trait methylation random field (multi-MRF) method to evaluate the joint association between a set of CpG sites and a set of genetic variants. The proposed method has several advantages. First, it is a multi-trait method that allows flexible correlation structures between neighboring CpG sites (e.g. distance-based correlation). Second, it is also a multi-locus method that integrates the effect of multiple common and rare genetic variants. Third, it models the methylation traits with a beta distribution to characterize their bimodal and interval properties. Through simulations, we demonstrated that the proposed method had improved power over some existing methods under various disease scenarios. We further illustrated the proposed method via an application to a study of congenital heart defects (CHDs) with 83 cardiac tissue samples. Our results suggested that gene BACE2, a methylation quantitative trait locus (QTL) candidate, colocalized with expression QTLs in artery tibial and harbored genetic variants with nominal significant associations in two genome-wide association studies of CHD.

    Availability and implementation

    https://github.com/chenlyu2656/Multi-MRF.

    Supplementary information

    Supplementary data are available at Bioinformatics online.