skip to main content


Title: A Likelihood Approach for Uncovering Selective Sweep Signatures from Haplotype Data
Abstract Selective sweeps are frequent and varied signatures in the genomes of natural populations, and detecting them is consequently important in understanding mechanisms of adaptation by natural selection. Following a selective sweep, haplotypic diversity surrounding the site under selection decreases, and this deviation from the background pattern of variation can be applied to identify sweeps. Multiple methods exist to locate selective sweeps in the genome from haplotype data, but none leverages the power of a model-based approach to make their inference. Here, we propose a likelihood ratio test statistic T to probe whole-genome polymorphism data sets for selective sweep signatures. Our framework uses a simple but powerful model of haplotype frequency spectrum distortion to find sweeps and additionally make an inference on the number of presently sweeping haplotypes in a population. We found that the T statistic is suitable for detecting both hard and soft sweeps across a variety of demographic models, selection strengths, and ages of the beneficial allele. Accordingly, we applied the T statistic to variant calls from European and sub-Saharan African human populations, yielding primarily literature-supported candidates, including LCT, RSPH3, and ZNF211 in CEU, SYT1, RGS18, and NNT in YRI, and HLA genes in both populations. We also searched for sweep signatures in Drosophila melanogaster, finding expected candidates at Ace, Uhg1, and Pimet. Finally, we provide open-source software to compute the T statistic and the inferred number of presently sweeping haplotypes from whole-genome data.  more » « less
Award ID(s):
1949268 2001063
NSF-PAR ID:
10213844
Author(s) / Creator(s):
;
Editor(s):
Kim, Yuseob
Date Published:
Journal Name:
Molecular Biology and Evolution
Volume:
37
Issue:
10
ISSN:
0737-4038
Page Range / eLocation ID:
3023 to 3046
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Positive selection causes beneficial alleles to rise to high frequency, resulting in a selective sweep of the diversity surrounding the selected sites. Accordingly, the signature of a selective sweep in an ancestral population may still remain in its descendants. Identifying signatures of selection in the ancestor that are shared among its descendants is important to contextualize the timing of a sweep, but few methods exist for this purpose. We introduce the statistic SS-H12, which can identify genomic regions under shared positive selection across populations and is based on the theory of the expected haplotype homozygosity statistic H12, which detects recent hard and soft sweeps from the presence of high-frequency haplotypes. SS-H12 is distinct from comparable statistics because it requires a minimum of only two populations, and properly identifies and differentiates between independent convergent sweeps and true ancestral sweeps, with high power and robustness to a variety of demographic models. Furthermore, we can apply SS-H12 in conjunction with the ratio of statistics we term Embedded Image and Embedded Image to further classify identified shared sweeps as hard or soft. Finally, we identified both previously reported and novel shared sweep candidates from human whole-genome sequences. Previously reported candidates include the well-characterized ancestral sweeps at LCT and SLC24A5 in Indo-Europeans, as well as GPHN worldwide. Novel candidates include an ancestral sweep at RGS18 in sub-Saharan Africans involved in regulating the platelet response and implicated in sudden cardiac death, and a convergent sweep at C2CD5 between European and East Asian populations that may explain their different insulin responses. 
    more » « less
  2. Kim, Yuseob (Ed.)
    Abstract Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data. 
    more » « less
  3. Buerkle, Alex (Ed.)
    The inference of positive selection in genomes is a problem of great interest in evolutionary genomics. By identifying putative regions of the genome that contain adaptive mutations, we are able to learn about the biology of organisms and their evolutionary history. Here we introduce a composite likelihood method that identifies recently completed or ongoing positive selection by searching for extreme distortions in the spatial distribution of the haplotype frequency spectrum along the genome relative to the genome-wide expectation taken as neutrality. Furthermore, the method simultaneously infers two parameters of the sweep: the number of sweeping haplotypes and the “width” of the sweep, which is related to the strength and timing of selection. We demonstrate that this method outperforms the leading haplotype-based selection statistics, though strong signals in low-recombination regions merit extra scrutiny. As a positive control, we apply it to two well-studied human populations from the 1000 Genomes Project and examine haplotype frequency spectrum patterns at the LCT and MHC loci. We also apply it to a data set of brown rats sampled in NYC and identify genes related to olfactory perception. To facilitate use of this method, we have implemented it in user-friendly open source software. 
    more » « less
  4. Betancourt, Andrea (Ed.)
    Abstract Local adaptation can lead to elevated genetic differentiation at the targeted genetic variant and nearby sites. Selective sweeps come in different forms, and depending on the initial and final frequencies of a favored variant, very different patterns of genetic variation may be produced. If local selection favors an existing variant that had already recombined onto multiple genetic backgrounds, then the width of elevated genetic differentiation (high FST) may be too narrow to detect using a typical windowed genome scan, even if the targeted variant becomes highly differentiated. We, therefore, used a simulation approach to investigate the power of SNP-level FST (specifically, the maximum SNP FST value within a window, or FST_MaxSNP) to detect diverse scenarios of local adaptation, and compared it against whole-window FST and the Comparative Haplotype Identity statistic. We found that FST_MaxSNP had superior power to detect complete or mostly complete soft sweeps, but lesser power than full-window statistics to detect partial hard sweeps. Nonetheless, the power of FST_MaxSNP depended highly on sample size, and confident outliers depend on robust precautions and quality control. To investigate the relative enrichment of FST_MaxSNP outliers from real data, we applied the two FST statistics to a panel of Drosophila melanogaster populations. We found that FST_MaxSNP had a genome-wide enrichment of outliers compared with demographic expectations, and though it yielded a lesser enrichment than window FST, it detected mostly unique outlier genes and functional categories. Our results suggest that FST_MaxSNP is highly complementary to typical window-based approaches for detecting local adaptation, and merits inclusion in future genome scans and methodologies. 
    more » « less
  5. Abstract

    The reduction of genetic diversity due to genetic hitchhiking is widely used to find past selective sweeps from sequencing data, but very little is known about how spatial structure affects hitchhiking. We use mathematical modeling and simulations to find the unfolded site frequency spectrum left by hitchhiking in the genomic region of a sweep in a population occupying a 1D range. For such populations, sweeps spread as Fisher waves, rather than logistically. We find that this leaves a characteristic 3-part site frequency spectrum at loci very close to the swept locus. Very low frequencies are dominated by recent mutations that occurred after the sweep and are unaffected by hitchhiking. At moderately low frequencies, there is a transition zone primarily composed of alleles that briefly “surfed” on the wave of the sweep before falling out of the wavefront, leaving a spectrum close to that expected in well-mixed populations. However, for moderate-to-high frequencies, there is a distinctive scaling regime of the site frequency spectrum produced by alleles that drifted to fixation in the wavefront and then were carried throughout the population. For loci slightly farther away from the swept locus on the genome, recombination is much more effective at restoring diversity in 1D populations than it is in well-mixed ones. We find that these signatures of space can be strong even in apparently well-mixed populations with negligible spatial genetic differentiation, suggesting that spatial structure may frequently distort the signatures of hitchhiking in natural populations.

     
    more » « less