skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PySmooth: a Python tool for the removal and correction of genotyping errors
In genetic mapping studies involving many individuals, genome-wide markers such as single nucleotide polymorphisms (SNPs) can be detected using different methods. However, it comes with some errors. Some SNPs associated with diseases can be in regions encoding long noncoding RNAs (lncRNAs). Therefore, identifying the errors in genotype file and correcting them is crucial for accurate genetic mapping studies. We develop a Python tool called PySmooth, that offers an easy-to-use command line interface for the removal and correction of genotyping errors. PySmooth uses the approach of a previous tool called SMOOTH with some modifications. It inputs a genotype file, detects errors and corrects them. PySmooth provides additional features such as imputing missing data, better user-friendly usage, generates summary and visualization files, has flexible parameters, and handles more genotype codes.  more » « less
Award ID(s):
2135305 2135306
PAR ID:
10579975
Author(s) / Creator(s):
;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
BMC Research Notes
Volume:
17
Issue:
1
ISSN:
1756-0500
Subject(s) / Keyword(s):
SNPs, QTLs, SMOOTH, genotype mapping, and correction
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Schwartz, Russell (Ed.)
    Abstract SummaryPool sequencing is an efficient method for capturing genome-wide allele frequencies from multiple individuals, with broad applications such as studying adaptation in Evolve-and-Resequence experiments, monitoring of genetic diversity in wild populations, and genotype-to-phenotype mapping. Here, we present grenedalf, a command line tool written in C++ that implements common population genetic statistics such as θ, Tajima’s D, and FST for Pool sequencing. It is orders of magnitude faster than current tools, and is focused on providing usability and scalability, while also offering a plethora of input file formats and convenience options. Availability and implementationgrenedalf is published under the GPL-3, and freely available at github.com/lczech/grenedalf. 
    more » « less
  2. Abstract Species living in changing environments require the acclimatization of individual organisms, which may be significantly influenced by allele specific expression (ASE). Data from RNA-seq experiments can be used to identify and quantify the expressed alleles. However, conventional allele matching to the reference genome creates a mapping bias towards the reference allele that prevents a reliable estimation of the allele counts. We developed a pipeline that allows identification and unbiased quantification of the alleles corresponding to an RNA-seq dataset, without any previous knowledge of the haplotype. To achieve the unbiased mapping, we generate two pseudogenomes by substituting the alternative alleles on the reference genome. The SNPs are further called against each pseudogenome, providing two SNP data-sets that are averaged for calculation of the allele depth to be merged in a final SNP calling file. The pipeline presented here can calculate ASE in non-model organisms and can be applied to previous RNA-seq data-sets for expanding studies in gene expression regulation. 
    more » « less
  3. Mapping the genetic basis of complex traits is critical to uncovering the biological mechanisms that underlie disease and other phenotypes. Genome-wide association studies (GWAS) in humans and quantitative trait locus (QTL) mapping in model organisms can now explain much of the observed heritability in many traits, allowing us to predict phenotype from genotype. However, constraints on power due to statistical confounders in large GWAS and smaller sample sizes in QTL studies still limit our ability to resolve numerous small-effect variants, map them to causal genes, identify pleiotropic effects across multiple traits, and infer non-additive interactions between loci (epistasis). Here, we introduce barcoded bulk quantitative trait locus (BB-QTL) mapping, which allows us to construct, genotype, and phenotype 100,000 offspring of a budding yeast cross, two orders of magnitude larger than the previous state of the art. We use this panel to map the genetic basis of eighteen complex traits, finding that the genetic architecture of these traits involves hundreds of small-effect loci densely spaced throughout the genome, many with widespread pleiotropic effects across multiple traits. Epistasis plays a central role, with thousands of interactions that provide insight into genetic networks. By dramatically increasing sample size, BB-QTL mapping demonstrates the potential of natural variants in high-powered QTL studies to reveal the highly polygenic, pleiotropic, and epistatic architecture of complex traits. 
    more » « less
  4. Abstract Genome‐wide expression quantitative trait loci (eQTLs) mapping explores the relationship between gene expression and DNA variants, such as single‐nucleotide polymorphism (SNPs), to understand genetic basis of human diseases. Due to the large number of genes and SNPs that need to be assessed, current methods for eQTL mapping often suffer from low detection power, especially for identifyingtrans‐eQTLs. In this paper, we propose the idea of performing SNP ranking based on the higher criticism statistic, a summary statistic developed in large‐scale signal detection. We illustrate how the HC‐based SNP ranking can effectively prioritize eQTL signals over noise, greatly reduce the burden of joint modeling, and improve the power for eQTL mapping. Numerical results in simulation studies demonstrate the superior performance of our method compared to existing methods. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs. 
    more » « less
  5. Abstract MotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens. ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log⁡3N, log⁡3M)).We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min. Availability and implementationThe RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg. 
    more » « less