skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data
Abstract Background

Low-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce to analyze population structure of low-depth sequencing data.


The method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common.


We apply to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The package is available on

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Genetic barcoding provides a high-throughput way to simultaneously track the frequencies of large numbers of competing and evolving microbial lineages. However making inferences about the nature of the evolution that is taking place remains a difficult task.


    Here we describe an algorithm for the inference of fitness effects and establishment times of beneficial mutations from barcode sequencing data, which builds upon a Bayesian inference method by enforcing self-consistency between the population mean fitness and the individual effects of mutations within lineages. By testing our inference method on a simulation of 40,000 barcoded lineages evolving in serial batch culture, we find that this new method outperforms its predecessor, identifying more adaptive mutations and more accurately inferring their mutational parameters.


    Our new algorithm is particularly suited to inference of mutational parameters when read depth is low. We have made Python code for our serial dilution evolution simulations, as well as both the old and new inference methods, available on GitHub (, in the hope that it can find broader use by the microbial evolution community.

    more » « less
  2. Abstract Background

    Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs.


    We introduce a novel statistic,Hind/HE, that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value ofHind/HEis the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species.


    Our methodology for estimatingHind/HEacross loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time.

    more » « less
  3. Abstract

    Monitoring genetic diversity in wild populations is a central goal of ecological and evolutionary genetics and is critical for conservation biology. However, genetic studies of nonmodel organisms generally lack access to species‐specific genotyping methods (e.g. array‐based genotyping) and must instead use sequencing‐based approaches. Although costs are decreasing, high‐coverage whole‐genome sequencing (WGS), which produces the highest confidence genotypes, remains expensive. More economical reduced representation sequencing approaches fail to capture much of the genome, which can hinder downstream inference. Low‐coverage WGS combined with imputation using a high‐confidence reference panel is a cost‐effective alternative, but the accuracy of genotyping using low‐coverage WGS and imputation in nonmodel populations is still largely uncharacterized. Here, we empirically tested the accuracy of low‐coverage sequencing (0.1–10×) and imputation in two natural populations, one with a large (n = 741) reference panel, rhesus macaques (Macaca mulatta), and one with a smaller (n = 68) reference panel, gelada monkeys (Theropithecus gelada). Using samples sequenced to coverage as low as 0.5×, we could impute genotypes at >95% of the sites in the reference panel with high accuracy (medianr2 ≥ 0.92). We show that low‐coverage imputed genotypes can reliably calculate genetic relatedness and population structure. Based on these data, we also provide best practices and recommendations for researchers who wish to deploy this approach in other populations, with all code available on GitHub (‐for‐non‐model‐species). Our results endorse accurate and effective genotype imputation from low‐coverage sequencing, enabling the cost‐effective generation of population‐scale genetic datasets necessary for tackling many pressing challenges of wildlife conservation.

    more » « less

    This article has been awarded Open Materials, Open Data, Preregistered Research Designs Badges. All materials and data are publicly accessible via the Open Science Framework at,,

    more » « less
  5. Abstract Background

    Structural variation (SV), which ranges from 50 bp to$$\sim$$ 3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals.


    In this paper, we develop a new method called EigenDel for detecting the germline submicroscopic genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data.


    Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias. EigenDel can be downloaded from

    more » « less