skip to main content

Title: Chromosome‐scale inference of hybrid speciation and admixture with convolutional neural networks

Inferring the frequency and mode of hybridization among closely related organisms is an important step for understanding the process of speciation and can help to uncover reticulated patterns of phylogeny more generally. Phylogenomic methods to test for the presence of hybridization come in many varieties and typically operate by leveraging expected patterns of genealogical discordance in the absence of hybridization. An important assumption made by these tests is that the data (genes or SNPs) are independent given the species tree. However, when the data are closely linked, it is especially important to consider their nonindependence. Recently, deep learning techniques such as convolutional neural networks (CNNs) have been used to perform population genetic inferences with linked SNPs coded as binary images. Here, we use CNNs for selecting among candidate hybridization scenarios using the tree topology (((P1,P2),P3), Out) and a matrix of pairwise nucleotide divergence (dXY) calculated in windows across the genome. Using coalescent simulations to train and independently test a neural network showed that our method, HyDe‐CNN, was able to accurately perform model selection for hybridization scenarios across a wide breath of parameter space. We then used HyDe‐CNN to test models of admixture inHeliconiusbutterflies, as well as comparing it to phylogeny‐based introgression statistics. Given the flexibility of our approach, the dropping cost of long‐read sequencing and the continued improvement of CNN architectures, we anticipate that inferences of hybridization using deep learning methods like ours will help researchers to better understand patterns of admixture in their study organisms.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
Molecular Ecology Resources
Page Range / eLocation ID:
p. 2676-2688
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kubatko, Laura (Ed.)
    Abstract Evidence from natural systems suggests that hybridization between animal species is more common than traditionally thought, but the overall contribution of introgression to standing genetic variation within species remains unclear for most animal systems. Here, we use targeted exon capture to sequence thousands of nuclear loci and complete mitochondrial genomes from closely related chipmunk species in the Tamias quadrivittatus group that are distributed across the Great Basin and the central and southern Rocky Mountains of North America. This recent radiation includes six overlapping, ecologically distinct species (Tamias canipes, Tamias cinereicollis, Tamias dorsalis, T. quadrivittatus, Tamias rufus, and Tamias umbrinus) that show evidence for widespread introgression across species boundaries. Such evidence has historically been derived from a handful of markers, typically focused on mitochondrial loci, to describe patterns of introgression; consequently, the extent of introgression of nuclear genes is less well characterized. We conducted a series of phylogenomic and species-tree analyses to resolve the phylogeny of six species in this group. In addition, we performed several population-genomic analyses to characterize nuclear genomes and infer coancestry among individuals. Furthermore, we used emerging quartets-based approaches to simultaneously infer the species tree (SVDquartets) and identify introgression (HyDe). We found that, in spite of rampant introgression of mitochondrial genomes between some species pairs (and sometimes involving up to three species), there appears to be little to no evidence for nuclear introgression. These findings mirror other genomic results where complete mitochondrial capture has occurred between chipmunk species in the absence of appreciable nuclear gene flow. The underlying causes of recurrent massive cytonuclear discordance remain unresolved in this group but mitochondrial DNA appears highly misleading of population histories as a whole. Collectively, it appears that chipmunk species boundaries are largely impermeable to nuclear gene flow and that hybridization, while pervasive with respect to mtDNA, has likely played a relatively minor role in the evolutionary history of this group. [Cytonuclear discordance; hyridization; introgression, phylogenomics; SVDquartets; Tamias.] 
    more » « less
  2. Abstract

    Populus tremuloidesis the widest‐ranging tree species in North America and an ecologically important component of mesic forest ecosystems displaced by the Pleistocene glaciations. Using phylogeographic analyses of genome‐wide SNPs (34,796 SNPs, 183 individuals) and ecological niche modeling, we inferred population structure, ploidy levels, admixture, and Pleistocene range dynamics ofP. tremuloides, and tested several historical biogeographical hypotheses. We found three genetic lineages located mainly in coastal–Cascades (cluster 1), east‐slope Cascades–Sierra Nevadas–Northern Rockies (cluster 2), and U.S. Rocky Mountains through southern Canadian (cluster 3) regions of theP. tremuloidesrange, with tree graph relationships of the form ((cluster 1, cluster 2), cluster 3). Populations consisted mainly of diploids (86%) but also small numbers of triploids (12%) and tetraploids (1%), and ploidy did not adversely affect our genetic inferences. The main vector of admixture was from cluster 3 into cluster 2, with the admixture zone trending northwest through the Rocky Mountains along a recognized phenotypic cline (Utah to Idaho). Clusters 1 and 2 provided strong support for the “stable‐edge hypothesis” that unglaciated southwestern populations persisted in situ since the last glaciation. By contrast, despite a lack of clinal genetic variation, cluster 3 exhibited “trailing‐edge” dynamics from niche suitability predictions signifying complete northward postglacial expansion. Results were also consistent with the “inland dispersal hypothesis” predicting postglacial assembly of Pacific Northwestern forest ecosystems, but rejected the hypothesis that Pacific‐coastal populations were colonized during outburst flooding from glacial Lake Missoula. Overall, congruent patterns between our phylogeographic and ecological niche modeling results and fossil pollen data demonstrate complex mixtures of stable‐edge, refugial locations, and postglacial expansion withinP. tremuloides. These findings confirm and refine previous genetic studies, while strongly supporting a distinct Pacific‐coastal genetic lineage of quaking aspen.

    more » « less

    The circum-galactic medium (CGM) can feasibly be mapped by multiwavelength surveys covering broad swaths of the sky. With multiple large data sets becoming available in the near future, we develop a likelihood-free Deep Learning technique using convolutional neural networks (CNNs) to infer broad-scale physical properties of a galaxy’s CGM and its halo mass for the first time. Using CAMELS (Cosmology and Astrophysics with MachinE Learning Simulations) data, including IllustrisTNG, SIMBA, and Astrid models, we train CNNs on Soft X-ray and 21-cm (H i) radio two-dimensional maps to trace hot and cool gas, respectively, around galaxies, groups, and clusters. Our CNNs offer the unique ability to train and test on ‘multifield’ data sets comprised of both H i and X-ray maps, providing complementary information about physical CGM properties and improved inferences. Applying eRASS:4 survey limits shows that X-ray is not powerful enough to infer individual haloes with masses log (Mhalo/M⊙) < 12.5. The multifield improves the inference for all halo masses. Generally, the CNN trained and tested on Astrid (SIMBA) can most (least) accurately infer CGM properties. Cross-simulation analysis – training on one galaxy formation model and testing on another – highlights the challenges of developing CNNs trained on a single model to marginalize over astrophysical uncertainties and perform robust inferences on real data. The next crucial step in improving the resulting inferences on the physical properties of CGM depends on our ability to interpret these deep-learning models.

    more » « less
  4. Phylogenomic investigations of biodiversity facilitate the detection of fine-scale population genetic structure and the demographic histories of species and populations. However, determining whether or not the genetic divergence measured among populations reflects species-level differentiation remains a central challenge in species delimitation. One potential solution is to compare genetic divergence between putative new species with other closely related species, sometimes referred to as a reference-based taxonomy. To be described as a new species, a population should be at least as divergent as other species. Here, we develop a reference-based taxonomy for Horned Lizards ( Phrynosoma ; 17 species) using phylogenomic data (ddRADseq data) to provide a framework for delimiting species in the Greater Short-horned Lizard species complex ( P. hernandesi ). Previous species delimitation studies of this species complex have produced conflicting results, with morphological data suggesting that P. hernandesi consists of five species, whereas mitochondrial DNA support anywhere from 1 to 10 + species. To help address this conflict, we first estimated a time-calibrated species tree for P. hernandesi and close relatives using SNP data. These results support the paraphyly of P. hernandesi; we recommend the recognition of two species to promote a taxonomy that is consistent with species monophyly. There is strong evidence for three populations within P. hernandesi , and demographic modeling and admixture analyses suggest that these populations are not reproductively isolated, which is consistent with previous morphological analyses that suggest hybridization could be common. Finally, we characterize the population-species boundary by quantifying levels of genetic divergence for all 18 Phrynosoma species. Genetic divergence measures for western and southern populations of P. hernandesi failed to exceed those of other Phrynosoma species, but the relatively small population size estimated for the northern population causes it to appear as a relatively divergent species. These comparisons underscore the difficulties associated with putting a reference-based approach to species delimitation into practice. Nevertheless, the reference-based approach offers a promising framework for the consistent assessment of biodiversity within clades of organisms with similar life histories and ecological traits. 
    more » « less
  5. Abstract

    Insights into the generation of diversity in both plants and animals have relied heavily on studying speciation in adaptive radiations. Russia's Lake Baikal has facilitated a putative adaptive radiation of cottid fishes (sculpins), some of which are highly specialized to inhabit novel niches created by the lake's unique geology and ecology. Here, we test evolutionary relationships and novel morphological adaptation in a piece of this radiation: the Baikal cottid genus,Cottocomephorus, a morphologically derived benthopelagic genus of three described species. We used a combination of mitochondrial DNA and restriction site associated DNA sequencing from allCottocomephorusspecies. Analysis of mitochondrial cytochrome b haplotypes was only able to two resolve two lineages:CgrewingkiiandCcomephoroides/inermis. Phylogenetic inference, principal component analysis, andfaststructureof genome‐wide SNPs uncovered three lineages withinCottocomephorus:Ccomephoroides,CinermisandCgrewingkii. We found recent divergence and admixture betweenCcomephoroidesandCinermisand deep divergence between these two species andCgrewingkii. Contrasting other fish radiations, we found no evidence of ancient hybridization amongCottocomephorusspecies. Digital morphology revealed highly derived pelagic phenotypes that reflect divergence by specialization to the benthopelagic niche inCottocomephorus. AmongCottocomephorusspecies, we found evidence of ongoing adaptation to the pelagic zone. This pattern highlights the importance of speciation along a benthic‐pelagic gradient seen inCottocomephorusand across other adaptive fish radiations.

    more » « less