Abstract MotivationAccurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. ResultsWe modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. Availability and ImplementationMethods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). Supplementary informationSupplementary data is available at Bioinformatics online.
more »
« less
modSaRa: a computationally efficient R package for CNV identification
Abstract SummaryChromosomal copy number variation (CNV) refers to a polymorphism that a DNA segment presents deletion or duplication in the population. The computational algorithms developed to identify this type of variation are usually of high computational complexity. Here we present a user-friendly R package, modSaRa, designed to perform copy number variants identification. The package is developed based on a change-point based method with optimal computational complexity and desirable accuracy. The current version of modSaRa package is a comprehensive tool with integration of preprocessing steps and main CNV calling steps. Availability and ImplementationmodSaRa is an R package written in R, C ++ and Rcpp and is now freely available for download at http://c2s2.yale.edu/software/modSaRa. Supplementary informationSupplementary data are available at Bioinformatics online.
more »
« less
- Award ID(s):
- 1309507
- PAR ID:
- 10629177
- Editor(s):
- Hancock, John
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 33
- Issue:
- 15
- ISSN:
- 1367-4803
- Page Range / eLocation ID:
- 2384 to 2385
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationThis article introduces Vivarium—software born of the idea that it should be as easy as possible for computational biologists to define any imaginable mechanistic model, combine it with existing models and execute them together as an integrated multiscale model. Integrative multiscale modeling confronts the complexity of biology by combining heterogeneous datasets and diverse modeling strategies into unified representations. These integrated models are then run to simulate how the hypothesized mechanisms operate as a whole. But building such models has been a labor-intensive process that requires many contributors, and they are still primarily developed on a case-by-case basis with each project starting anew. New software tools that streamline the integrative modeling effort and facilitate collaboration are therefore essential for future computational biologists. ResultsVivarium is a software tool for building integrative multiscale models. It provides an interface that makes individual models into modules that can be wired together in large composite models, parallelized across multiple CPUs and run with Vivarium’s discrete-event simulation engine. Vivarium’s utility is demonstrated by building composite models that combine several modeling frameworks: agent-based models, ordinary differential equations, stochastic reaction systems, constraint-based models, solid-body physics and spatial diffusion. This demonstrates just the beginning of what is possible—Vivarium will be able to support future efforts that integrate many more types of models and at many more biological scales. Availability and implementationThe specific models, simulation pipelines and notebooks developed for this article are all available at the vivarium-notebooks repository: https://github.com/vivarium-collective/vivarium-notebooks. Vivarium-core is available at https://github.com/vivarium-collective/vivarium-core, and has been released on Python Package Index. The Vivarium Collective (https://vivarium-collective.github.io) is a repository of freely available Vivarium processes and composites, including the processes used in Section 3. Supplementary Materials provide with an extensive methodology section, with several code listings that demonstrate the basic interfaces. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Cheng, Jianlin (Ed.)Abstract MotivationA Multiple Sequence Alignment (MSA) contains fundamental evolutionary information that is useful in the prediction of structure and function of proteins and nucleic acids. The “Number of Effective Sequences” (NEFF) quantifies the diversity of sequences of an MSA. While several tools embed NEFF calculation with various options, none are standalone tools for this purpose, and they do not offer all the available options. ResultsWe developed NEFFy, the first software package to integrate all these options and calculate NEFF across diverse MSA formats for proteins, RNAs, and DNAs. It surpasses existing tools in functionality without compromising computational efficiency and scalability. NEFFy also offers per-residue NEFF calculation and supports NEFF computation for MSAs of multimeric proteins, with the capability to be extended to DNAs and RNAs. Availability and ImplementationNEFFy is released as open-source software under the GNU Public License v3.0. The source code in C ++ and a Python wrapper are available at https://github.com/Maryam-Haghani/NEFFy. To ensure users can fully leverage these capabilities, comprehensive documentation and examples are provided at https://Maryam-Haghani.github.io/NEFFy. Supplementary InformationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract MotivationPolygenic risk score (PRS) has been widely exploited for genetic risk prediction due to its accuracy and conceptual simplicity. We introduce a unified Bayesian regression framework, NeuPred, for PRS construction, which accommodates varying genetic architectures and improves overall prediction accuracy for complex diseases by allowing for a wide class of prior choices. To take full advantage of the framework, we propose a summary-statistics-based cross-validation strategy to automatically select suitable chromosome-level priors, which demonstrates a striking variability of the prior preference of each chromosome, for the same complex disease, and further significantly improves the prediction accuracy. ResultsSimulation studies and real data applications with seven disease datasets from the Wellcome Trust Case Control Consortium cohort and eight groups of large-scale genome-wide association studies demonstrate that NeuPred achieves substantial and consistent improvements in terms of predictive r2 over existing methods. In addition, NeuPred has similar or advantageous computational efficiency compared with the state-of-the-art Bayesian methods. Availability and implementationThe R package implementing NeuPred is available at https://github.com/shuangsong0110/NeuPred. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract SummaryTranscription factors (TFs) are proteins that directly interpret the genome to regulate gene expression and determine cellular phenotypes. TF identification is a common first step in unraveling gene regulatory networks. We present CREPE, an R Shiny app to catalogue and annotate TFs. CREPE was benchmarked against curated human TF datasets. Next, we use CREPE to explore the TF repertoires of Heliconius erato and Heliconius melpomene butterflies. Availability and implementationCREPE is available as a Shiny app package available at GitHub (github.com/dirostri/CREPE). Supplementary informationSupplementary data are available at Bioinformatics Advances online.more » « less
An official website of the United States government

