Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Wu, Steven H.; Schwartz, Rachel S.; Winter, David J.; Conrad, Donald F.; Cartwright, Reed A.; Stegle, ed., Oliver

doi:10.1093/bioinformatics/btx133

Citation Details

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Abstract MotivationAccurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. ResultsWe modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. Availability and ImplementationMethods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). Supplementary informationSupplementary data is available at Bioinformatics online. more »

Award ID(s):: 1356548

PAR ID:: 10427324

Author(s) / Creator(s):: Wu, Steven H.; Schwartz, Rachel S.; Winter, David J.; Conrad, Donald F.; Cartwright, Reed A.; Stegle, ed., Oliver

Publisher / Repository:: Oxford University Press

Date Published:: 2017-03-15

Journal Name:: Bioinformatics

Volume:: 33

Issue:: 15

ISSN:: 1367-4803

Page Range / eLocation ID:: p. 2322-2329

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1093/bioinformatics/btx133

More Like this