skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Sparse Convolutional Denoising Autoencoders for Genotype Imputation
Genotype imputation, where missing genotypes can be computationally imputed, is an essential tool in genomic analysis ranging from genome wide associations to phenotype prediction. Traditional genotype imputation methods are typically based on haplotype-clustering algorithms, hidden Markov models (HMMs), and statistical inference. Deep learning-based methods have been recently reported to suitably address the missing data problems in various fields. To explore the performance of deep learning for genotype imputation, in this study, we propose a deep model called a sparse convolutional denoising autoencoder (SCDA) to impute missing genotypes. We constructed the SCDA model using a convolutional layer that can extract various correlation or linkage patterns in the genotype data and applying a sparse weight matrix resulted from the L1 regularization to handle high dimensional data. We comprehensively evaluated the performance of the SCDA model in different scenarios for genotype imputation on the yeast and human genotype data, respectively. Our results showed that SCDA has strong robustness and significantly outperforms popular reference-free imputation methods. This study thus points to another novel application of deep learning models for missing data imputation in genomic studies.  more » « less
Award ID(s):
2001080
PAR ID:
10422322
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Genes
Volume:
10
Issue:
9
ISSN:
2073-4425
Page Range / eLocation ID:
652
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Schwartz, Russell (Ed.)
    Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Kelso, Janet (Ed.)
    Abstract Motivation Current methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers. Results We introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing. Availability and implementation Software, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naïve applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The nonlinear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates. 
    more » « less
  4. Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naïve applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The nonlinear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates. 
    more » « less
  5. Abstract Biobanks that collect deep phenotypic and genomic data across many individuals have emerged as a key resource in human genetics. However, phenotypes in biobanks are often missing across many individuals, limiting their utility. We propose AutoComplete, a deep learning-based imputation method to impute or ‘fill-in’ missing phenotypes in population-scale biobank datasets. When applied to collections of phenotypes measured across ~300,000 individuals from the UK Biobank, AutoComplete substantially improved imputation accuracy over existing methods. On three traits with notable amounts of missingness, we show that AutoComplete yields imputed phenotypes that are genetically similar to the originally observed phenotypes while increasing the effective sample size by about twofold on average. Further, genome-wide association analyses on the resulting imputed phenotypes led to a substantial increase in the number of associated loci. Our results demonstrate the utility of deep learning-based phenotype imputation to increase power for genetic discoveries in existing biobank datasets. 
    more » « less