skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 12, 2026

Title: Polaris: Polarization of ancestral and derived polymorphic alleles for inferences of extended haplotype homozygosity in human populations
Statistical methods that measure the extent of haplotype homozygosity on chromosomes have been highly informative for identifying episodes of recent selection. For example, the integrated haplotype score (iHS) and the extended haplotype homozygosity (EHH) statistics detect long-range haplotype structure around derived and ancestral alleles indicative of classic and soft selective sweeps, respectively. However, to our knowledge, there are currently no publicly available methods that classify ancestral and derived alleles in genomic datasets for the purpose of quantifying the extent of haplotype homozygosity. Here, we introduce the Polaris package, which polarizes chromosomal variants into ancestral and derived alleles and creates corresponding genetic maps for analysis by selscan and HaploSweep, two versatile haplotype-based programs that perform scans for selection. With the input files generated by Polaris, selscan and/or HaploSweep can produce the appropriate sign (either positive or negative) for outlier iHS statistics, enabling users to distinguish between selection on derived or ancestral alleles. In addition, Polaris can convert the numerical output of these analyses into graphical representations of selective sweeps, increasing the functionality of our software. To demonstrate the utility of our approach, we applied the Polaris package to Chromosome 2 in the European Finnish, Middle Eastern Bedouin, and East African Maasai populations. More specifically, we examined the regulatory sequence in intron 13 of the MCM6 gene associated with lactase persistence (i.e. the ability to digest the lactose sugar present in fresh milk), a region of intense interest to human evolutionary geneticists. Our analyses showed that derived alleles (at known enhancers for lactase expression) sit on an extended haplotype background in the Finnish, Bedouin, and Maasai consistent with a classic selective sweep model as determined by iHS and EHH statistics. Importantly, we were able to immediately identify this target allele under selection based on the information generated by our software. We also explored outlier statistics across Chromosome 2 in two distinct datasets from these populations: (i) one containing polarized alleles generated with Polaris and (ii) the other containing unpolarized alleles in the original phased vcf file. Here, we found an excess of outlier statistics on Chromosome 2 in the unpolarized datasets, raising the possibility that a subset of these “hits” of selection may be unreliable. Overall, Polaris is a versatile package that enables users to efficiently explore, interpret, and report signals of recent selection in genomic datasets.  more » « less
Award ID(s):
2221920
PAR ID:
10635102
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
41
Issue:
6
ISSN:
1367-4811
Page Range / eLocation ID:
btaf171
Subject(s) / Keyword(s):
ancestral alleles derived alleles natural selection integrated haplotype score (iHS) extended haplotype homozygosity (EHH)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Positive selection causes beneficial alleles to rise to high frequency, resulting in a selective sweep of the diversity surrounding the selected sites. Accordingly, the signature of a selective sweep in an ancestral population may still remain in its descendants. Identifying signatures of selection in the ancestor that are shared among its descendants is important to contextualize the timing of a sweep, but few methods exist for this purpose. We introduce the statistic SS-H12, which can identify genomic regions under shared positive selection across populations and is based on the theory of the expected haplotype homozygosity statistic H12, which detects recent hard and soft sweeps from the presence of high-frequency haplotypes. SS-H12 is distinct from comparable statistics because it requires a minimum of only two populations, and properly identifies and differentiates between independent convergent sweeps and true ancestral sweeps, with high power and robustness to a variety of demographic models. Furthermore, we can apply SS-H12 in conjunction with the ratio of statistics we term Embedded Image and Embedded Image to further classify identified shared sweeps as hard or soft. Finally, we identified both previously reported and novel shared sweep candidates from human whole-genome sequences. Previously reported candidates include the well-characterized ancestral sweeps at LCT and SLC24A5 in Indo-Europeans, as well as GPHN worldwide. Novel candidates include an ancestral sweep at RGS18 in sub-Saharan Africans involved in regulating the platelet response and implicated in sudden cardiac death, and a convergent sweep at C2CD5 between European and East Asian populations that may explain their different insulin responses. 
    more » « less
  2. Betancourt, Andrea (Ed.)
    Abstract Local adaptation can lead to elevated genetic differentiation at the targeted genetic variant and nearby sites. Selective sweeps come in different forms, and depending on the initial and final frequencies of a favored variant, very different patterns of genetic variation may be produced. If local selection favors an existing variant that had already recombined onto multiple genetic backgrounds, then the width of elevated genetic differentiation (high FST) may be too narrow to detect using a typical windowed genome scan, even if the targeted variant becomes highly differentiated. We, therefore, used a simulation approach to investigate the power of SNP-level FST (specifically, the maximum SNP FST value within a window, or FST_MaxSNP) to detect diverse scenarios of local adaptation, and compared it against whole-window FST and the Comparative Haplotype Identity statistic. We found that FST_MaxSNP had superior power to detect complete or mostly complete soft sweeps, but lesser power than full-window statistics to detect partial hard sweeps. Nonetheless, the power of FST_MaxSNP depended highly on sample size, and confident outliers depend on robust precautions and quality control. To investigate the relative enrichment of FST_MaxSNP outliers from real data, we applied the two FST statistics to a panel of Drosophila melanogaster populations. We found that FST_MaxSNP had a genome-wide enrichment of outliers compared with demographic expectations, and though it yielded a lesser enrichment than window FST, it detected mostly unique outlier genes and functional categories. Our results suggest that FST_MaxSNP is highly complementary to typical window-based approaches for detecting local adaptation, and merits inclusion in future genome scans and methodologies. 
    more » « less
  3. Kim, Yuseob (Ed.)
    Abstract Selective sweeps are frequent and varied signatures in the genomes of natural populations, and detecting them is consequently important in understanding mechanisms of adaptation by natural selection. Following a selective sweep, haplotypic diversity surrounding the site under selection decreases, and this deviation from the background pattern of variation can be applied to identify sweeps. Multiple methods exist to locate selective sweeps in the genome from haplotype data, but none leverages the power of a model-based approach to make their inference. Here, we propose a likelihood ratio test statistic T to probe whole-genome polymorphism data sets for selective sweep signatures. Our framework uses a simple but powerful model of haplotype frequency spectrum distortion to find sweeps and additionally make an inference on the number of presently sweeping haplotypes in a population. We found that the T statistic is suitable for detecting both hard and soft sweeps across a variety of demographic models, selection strengths, and ages of the beneficial allele. Accordingly, we applied the T statistic to variant calls from European and sub-Saharan African human populations, yielding primarily literature-supported candidates, including LCT, RSPH3, and ZNF211 in CEU, SYT1, RGS18, and NNT in YRI, and HLA genes in both populations. We also searched for sweep signatures in Drosophila melanogaster, finding expected candidates at Ace, Uhg1, and Pimet. Finally, we provide open-source software to compute the T statistic and the inferred number of presently sweeping haplotypes from whole-genome data. 
    more » « less
  4. Evidence for a reduction in stature between Mesolithic foragers and Neolithic farmers has been interpreted as reflective of declines in health, however, our current understanding of this trend fails to account for the complexity of cultural and dietary transitions or the possible causes of phenotypic change. The agricultural transition was extended in primary centers of domestication and abrupt in regions characterized by demic diffusion. In regions such as Northern Europe where foreign domesticates were difficult to establish, there is strong evidence for natural selection for lactase persistence in relation to dairying. We employ broad-scale analyses of diachronic variation in stature and body mass in the Levant, Europe, the Nile Valley, South Asia, and China, to test three hypotheses about the timing of subsistence shifts and human body size, that: 1) the adoption of agriculture led to a decrease in stature, 2) there were different trajectories in regions of in situ domestication or cultural diffusion of agriculture; and 3) increases in stature and body mass are observed in regions with evidence for selection for lactase persistence. Our results demonstrate that 1) decreases in stature preceded the origins of agriculture in some regions; 2) the Levant and China, regions of in situ domestication of species and an extended period of mixed foraging and agricultural subsistence, had stable stature and body mass over time; and 3) stature and body mass increases in Central and Northern Europe coincide with the timing of selective sweeps for lactase persistence, providing support for the “Lactase Growth Hypothesis.” 
    more » « less
  5. Abstract In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps. 
    more » « less