skip to main content

Title: Using natural history to guide supervised machine learning for cryptic species delimitation with genetic data
Abstract

The diversity of biological and ecological characteristics of organisms, and the underlying genetic patterns and processes of speciation, makes the development of universally applicable genetic species delimitation methods challenging. Many approaches, like those incorporating the multispecies coalescent, sometimes delimit populations and overestimate species numbers. This issue is exacerbated in taxa with inherently high population structure due to low dispersal ability, and in cryptic species resulting from nonecological speciation. These taxa present a conundrum when delimiting species: analyses rely heavily, if not entirely, on genetic data which over split species, while other lines of evidence lump. We showcase this conundrum in the harvesterTheromaster brunneus, a low dispersal taxon with a wide geographic distribution and high potential for cryptic species. Integrating morphology, mitochondrial, and sub-genomic (double-digest RADSeq and ultraconserved elements) data, we find high discordance across analyses and data types in the number of inferred species, with further evidence that multispecies coalescent approaches over split. We demonstrate the power of a supervised machine learning approach in effectively delimiting cryptic species by creating a “custom” training data set derived from a well-studied lineage with similar biological characteristics asTheromaster. This novel approach uses known taxa with particular biological characteristics to inform unknown taxa more » with similar characteristics, using modern computational tools ideally suited for species delimitation. The approach also considers the natural history of organisms to make more biologically informed species delimitation decisions, and in principle is broadly applicable for taxa across the tree of life.

« less
Authors:
; ;
Publication Date:
NSF-PAR ID:
10363189
Journal Name:
Frontiers in Zoology
Volume:
19
Issue:
1
ISSN:
1742-9994
Publisher:
Springer Science + Business Media
Sponsoring Org:
National Science Foundation
More Like this
  1. Barraclough, Timothy G. (Ed.)
    The “multispecies” coalescent (MSC) model that underlies many genomic species-delimitation approaches is problematic because it does not distinguish between genetic structure associated with species versus that of populations within species. Consequently, as both the genomic and spatial resolution of data increases, a proliferation of artifactual species results as within-species population lineages, detected due to restrictions in gene flow, are identified as distinct species. The toll of this extends beyond systematic studies, getting magnified across the many disciplines that rely upon an accurate framework of identified species. Here we present the first of a new class of approaches that addresses this issue by incorporating an extended speciation process for species delimitation. We model the formation of population lineages and their subsequent development into independent species as separate processes and provide for a way to incorporate current understanding of the species boundaries in the system through specification of species identities of a subset of population lineages. As a result, species boundaries and within-species lineages boundaries can be discriminated across the entire system, and species identities can be assigned to the remaining lineages of unknown affinities with quantified probabilities. In addition to the identification of species units in nature, the primary goal ofmore »species delimitation, the incorporation of a speciation model also allows us insights into the links between population and species-level processes. By explicitly accounting for restrictions in gene flow not only between, but also within, species, we also address the limits of genetic data for delimiting species. Specifically, while genetic data alone is not sufficient for accurate delimitation, when considered in conjunction with other information we are able to not only learn about species boundaries, but also about the tempo of the speciation process itself.« less
  2. Background

    Vestimentiferan tubeworms are some of the most recognizable fauna found at deep-sea cold seeps, isolated environments where hydrocarbon rich fluids fuel biological communities. Several studies have investigated tubeworm population structure; however, much is still unknown about larval dispersal patterns at Gulf of Mexico (GoM) seeps. As such, researchers have applied microsatellite markers as a measure for documenting the transport of vestimentiferan individuals. In the present study, we investigate the utility of microsatellites to be cross-amplified within the escarpiid clade of seep vestimentiferans, by determining if loci originally developed forEscarpiaspp. could be amplified in the GoM seep tubeworm,Seepiophila jonesi. Additionally, we determine if cross-amplified loci can reliably uncover the same signatures of high gene flow seen in a previous investigation ofS. jonesi.

    Methods

    Seventy-sevenS. jonesiindividuals were collected from eight seep sites across the upper Louisiana slope (<1,000 m) in the GoM. Forty-eight microsatellite loci that were originally developed forEscarpia laminata(18 loci) andEscarpia southwardae(30 loci) were tested to determine if they were homologous and polymorphic inS. jonesi. Loci found to be both polymorphic and of high quality were used to test for significant population structuring inS. jonesi.

    Results

    Microsatellite pre-screening identified 13 (27%) of theEscarpialoci were homologous and polymorphic inS. jonesi, revealing that microsatellitesmore »can be amplified within the escarpiid clade of vestimentiferans. Our findings uncovered low levels of heterozygosity and a lack of genetic differentiation amongstS. jonesifrom various sites and regions, in line with previous investigations that employed species-specific polymorphic loci onS. jonesiindividuals retrieved from both the same and different seep sites. The lack of genetic structure identified from these populations supports the presence of significant gene flow via larval dispersal in mixed oceanic currents.

    Discussion

    The ability to develop “universal” microsatellites reduces the costs associated with these analyses and allows researchers to track and investigate a wider array of taxa, which is particularly useful for organisms living at inaccessible locations such as the deep sea. Our study highlights that non-species specific microsatellites can be amplified across large evolutionary distances and still yield similar findings as species-specific loci. Further, these results show thatS. jonesicollected from various localities in the GoM represents a single panmictic population, suggesting that dispersal of lecithotrophic larvae by deep sea currents is sufficient to homogenize populations. These data are consistent with the high levels of gene flow seen inEscarpiaspp., which advocates that differences in microhabitats of seep localities lead to variation in biogeography of separate species.

    « less
  3. Phylogenomic investigations of biodiversity facilitate the detection of fine-scale population genetic structure and the demographic histories of species and populations. However, determining whether or not the genetic divergence measured among populations reflects species-level differentiation remains a central challenge in species delimitation. One potential solution is to compare genetic divergence between putative new species with other closely related species, sometimes referred to as a reference-based taxonomy. To be described as a new species, a population should be at least as divergent as other species. Here, we develop a reference-based taxonomy for Horned Lizards ( Phrynosoma ; 17 species) using phylogenomic data (ddRADseq data) to provide a framework for delimiting species in the Greater Short-horned Lizard species complex ( P. hernandesi ). Previous species delimitation studies of this species complex have produced conflicting results, with morphological data suggesting that P. hernandesi consists of five species, whereas mitochondrial DNA support anywhere from 1 to 10 + species. To help address this conflict, we first estimated a time-calibrated species tree for P. hernandesi and close relatives using SNP data. These results support the paraphyly of P. hernandesi; we recommend the recognition of two species to promote a taxonomy that is consistent with species monophyly.more »There is strong evidence for three populations within P. hernandesi , and demographic modeling and admixture analyses suggest that these populations are not reproductively isolated, which is consistent with previous morphological analyses that suggest hybridization could be common. Finally, we characterize the population-species boundary by quantifying levels of genetic divergence for all 18 Phrynosoma species. Genetic divergence measures for western and southern populations of P. hernandesi failed to exceed those of other Phrynosoma species, but the relatively small population size estimated for the northern population causes it to appear as a relatively divergent species. These comparisons underscore the difficulties associated with putting a reference-based approach to species delimitation into practice. Nevertheless, the reference-based approach offers a promising framework for the consistent assessment of biodiversity within clades of organisms with similar life histories and ecological traits.« less
  4. In cryptic amphibian complexes, there is a growing trend to equate high levels of genetic structure with hidden cryptic species diversity. Typically, phylogenetic structure and distance-based approaches are used to demonstrate the distinctness of clades and justify the recognition of new cryptic species. However, this approach does not account for gene flow, spatial, and environmental processes that can obfuscate phylogenetic inference and bias species delimitation. As a case study, we sequenced genome-wide exons and introns to evince the processes that underlie the diversification of Philippine Puddle Frogs—a group that is widespread, phenotypically conserved, and exhibits high levels of geographically based genetic structure. We showed that widely adopted tree- and distance-based approaches inferred up to 20 species, compared to genomic analyses that inferred an optimal number of five distinct genetic groups. Using a suite of clustering, admixture, and phylogenetic network analyses, we demonstrate extensive admixture among the five groups and elucidate two specificways in which gene flowcan cause overestimations of species diversity: 1) admixed populations can be inferred as distinct lineages characterized by long branches in phylograms; and 2) admixed lineages can appear to be genetically divergent, even from their parental populations when simple measures of genetic distance are used. Wemore »demonstrate that the relationship between mitochondrial and genome-wide nuclear p-distances is decoupled in admixed clades, leading to erroneous estimates of genetic distances and, consequently, species diversity. Additionally, genetic distance was also biased by spatial and environmental processes. Overall, we showed that high levels of genetic diversity in Philippine Puddle Frogs predominantly comprise metapopulation lineages that arose through complex patterns of admixture, isolation-bydistance, and isolation-by-environment as opposed to species divergence. Our findings suggest that speciation may not be the major process underlying the high levels of hidden diversity observed in many taxonomic groups and that widely adopted tree- and distance-based methods overestimate species diversity in the presence of gene flow.« less
  5. In cryptic amphibian complexes, there is a growing trend to equate high levels of genetic structure with hidden cryptic species diversity. Typically, phylogenetic structure and distance-based approaches are used to demonstrate the distinctness of clades and justify the recognition of new cryptic species. However, this approach does not account for gene flow, spatial, and environmental processes that can obfuscate phylogenetic inference and bias species delimitation. As a case study, we sequenced genome-wide exons and introns to evince the processes that underlie the diversification of Philippine Puddle Frogs—a group that is widespread, phenotypically conserved, and exhibits high levels of geographically based genetic structure. We showed that widely adopted tree- and distance-based approaches inferred up to 20 species, compared to genomic analyses that inferred an optimal number of five distinct genetic groups. Using a suite of clustering, admixture, and phylogenetic network analyses, we demonstrate extensive admixture among the five groups and elucidate two specificways in which gene flowcan cause overestimations of species diversity: 1) admixed populations can be inferred as distinct lineages characterized by long branches in phylograms; and 2) admixed lineages can appear to be genetically divergent, even from their parental populations when simple measures of genetic distance are used. Wemore »demonstrate that the relationship between mitochondrial and genome-wide nuclear p-distances is decoupled in admixed clades, leading to erroneous estimates of genetic distances and, consequently, species diversity. Additionally, genetic distance was also biased by spatial and environmental processes. Overall, we showed that high levels of genetic diversity in Philippine Puddle Frogs predominantly comprise metapopulation lineages that arose through complex patterns of admixture, isolation-bydistance, and isolation-by-environment as opposed to species divergence. Our findings suggest that speciation may not be the major process underlying the high levels of hidden diversity observed in many taxonomic groups and that widely adopted tree- and distance-based methods overestimate species diversity in the presence of gene flow.« less