skip to main content


Title: Flexible Mixture Model Approaches That Accommodate Footprint Size Variability for Robust Detection of Balancing Selection
Abstract Long-term balancing selection typically leaves narrow footprints of increased genetic diversity, and therefore most detection approaches only achieve optimal performances when sufficiently small genomic regions (i.e., windows) are examined. Such methods are sensitive to window sizes and suffer substantial losses in power when windows are large. Here, we employ mixture models to construct a set of five composite likelihood ratio test statistics, which we collectively term B statistics. These statistics are agnostic to window sizes and can operate on diverse forms of input data. Through simulations, we show that they exhibit comparable power to the best-performing current methods, and retain substantially high power regardless of window sizes. They also display considerable robustness to high mutation rates and uneven recombination landscapes, as well as an array of other common confounding scenarios. Moreover, we applied a specific version of the B statistics, termed B2, to a human population-genomic data set and recovered many top candidates from prior studies, including the then-uncharacterized STPG2 and CCDC169–SOHLH2, both of which are related to gamete functions. We further applied B2 on a bonobo population-genomic data set. In addition to the MHC-DQ genes, we uncovered several novel candidate genes, such as KLRD1, involved in viral defense, and SCN9A, associated with pain perception. Finally, we show that our methods can be extended to account for multiallelic balancing selection and integrated the set of statistics into open-source software named BalLeRMix for future applications by the scientific community.  more » « less
Award ID(s):
1949268 2001063
PAR ID:
10213839
Author(s) / Creator(s):
;
Editor(s):
Satta, Yoko
Date Published:
Journal Name:
Molecular Biology and Evolution
Volume:
37
Issue:
11
ISSN:
0737-4038
Page Range / eLocation ID:
3267 to 3291
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The aye-aye (Daubentonia madagascariensis) is one of the 25 most endangered primate species in the world, maintaining amongst the lowest genetic diversity of any primate measured to date. Characterizing patterns of genetic variation within aye-aye populations, and the relative influences of neutral and selective processes in shaping that variation, is thus important for future conservation efforts. In this study, we performed the first whole-genome scans for recent positive and balancing selection in the species, utilizing high-coverage population genomic data from newly sequenced individuals. We generated null thresholds for our genomic scans by creating an evolutionarily appropriate baseline model that incorporates the demographic history of this aye-aye population, and identified a small number of candidate genes. Most notably, a suite of genes involved in olfaction — a key trait in these nocturnal primates — were identified as experiencing long-term balancing selection. We also conducted analyses to quantify the expected statistical power to detect positive and balancing selection in this population using site frequency spectrum-based inference methods, once accounting for the potentially confounding contributions of population history, recombination and mutation rate variation, and purifying and background selection. This work, presenting the first high-quality, genome-wide polymorphism data across the functional regions of the aye-aye genome, thus provides important insights into the landscape of episodic selective forces in this highly endangered species. 
    more » « less
  2. Kim, Yuseob (Ed.)
    Abstract Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data. 
    more » « less
  3. Abstract

    Because they are considered rare, balanced polymorphisms are often discounted as crucial constituents of genome‐wide variation in sequence diversity. Despite its perceived rarity, however, long‐term balancing selection can elevate genetic diversity and significantly affect observed divergence between species. Here, we discuss how ancestral balanced polymorphisms can be “sieved” by the speciation process, which sorts them unequally across descendant lineages. After speciation, ancestral balancing selection is revealed by genomic regions of high divergence between species. This signature, which resembles that of other evolutionary processes, can potentially confound genomic studies of population divergence and inferences of “islands of speciation.”

     
    more » « less
  4. Abstract

    Genetic diversity becomes structured among populations over time due to genetic drift and divergent selection. Although population structure is often treated as a uniform underlying factor, recent resequencing studies of wild populations have demonstrated that diversity in many regions of the genome may be structured quite dissimilar to the genome‐wide pattern. Here, we explored the adaptive and nonadaptive causes of such genomic heterogeneity using population‐level, whole genome resequencing data obtained from annualMimulus guttatusindividuals collected across a rugged environment landscape. We found substantial variation in how genetic differentiation is structured both within and between chromosomes, although, in contrast to other studies, known inversion polymorphisms appear to serve only minor roles in this heterogeneity. In addition, much of the genome can be clustered into eight among‐population genetic differentiation patterns, but only two of these clusters are particularly consistent with patterns of isolation by distance. By performing genotype‐environment association analysis, we also identified genomic intervals where local adaptation to specific climate factors has accentuated genetic differentiation among populations, and candidate genes in these windows indicate climate adaptation may proceed through changes affecting specialized metabolism, drought resistance, and development. Finally, by integrating our findings with previous studies, we show that multiple aspects of plant reproductive biology may be common targets of balancing selection and that variants historically involved in climate adaptation among populations have probably also fuelled rapid adaptation to microgeographic environmental variation within sites.

     
    more » « less
  5. Abstract Summary

    The growing availability of genomewide polymorphism data has fueled interest in detecting diverse selective processes affecting population diversity. However, no model-based approaches exist to jointly detect and distinguish the two complementary processes of balancing and positive selection. We extend the BalLeRMixB-statistic framework described in Cheng and DeGiorgio (2020) for detecting balancing selection and present BalLeRMix+, which implements five B statistic extensions based on mixture models to robustly identify both types of selection. BalLeRMix+ is implemented in Python and computes the composite likelihood ratios and associated model parameters for each genomic test position.

    Availability and implementation

    BalLeRMix+ is freely available at https://github.com/bioXiaoheng/BallerMixPlus.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less