skip to main content


Title: Minimum sample sizes for population genomics: an empirical study from an Amazonian plant species
Abstract

High‐throughput DNA sequencing facilitates the analysis of large portions of the genome in nonmodel organisms, ensuring high accuracy of population genetic parameters. However, empirical studies evaluating the appropriate sample size for these kinds of studies are still scarce. In this study, we use double‐digest restriction‐associated DNA sequencing (ddRADseq) to recover thousands of single nucleotide polymorphisms (SNPs) for two physically isolated populations ofAmphirrhox longifolia(Violaceae), a nonmodel plant species for which no reference genome is available. We used resampling techniques to construct simulated populations with a random subset of individuals and SNPs to determine how many individuals and biallelic markers should be sampled for accurate estimates of intra‐ and interpopulation genetic diversity. We identified 3646 and 4900 polymorphic SNPs for the two populations ofA. longifolia, respectively. Our simulations show that, overall, a sample size greater than eight individuals has little impact on estimates of genetic diversity withinA. longifoliapopulations, when 1000 SNPs or higher are used. Our results also show that even at a very small sample size (i.e. two individuals), accurate estimates ofFSTcan be obtained with a large number of SNPs (≥1500). These results highlight the potential of high‐throughput genomic sequencing approaches to address questions related to evolutionary biology in nonmodel organisms. Furthermore, our findings also provide insights into the optimization of sampling strategies in the era of population genomics.

 
more » « less
NSF-PAR ID:
10038645
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Molecular Ecology Resources
Volume:
17
Issue:
6
ISSN:
1755-098X
Page Range / eLocation ID:
p. 1136-1147
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Monitoring genetic diversity in wild populations is a central goal of ecological and evolutionary genetics and is critical for conservation biology. However, genetic studies of nonmodel organisms generally lack access to species‐specific genotyping methods (e.g. array‐based genotyping) and must instead use sequencing‐based approaches. Although costs are decreasing, high‐coverage whole‐genome sequencing (WGS), which produces the highest confidence genotypes, remains expensive. More economical reduced representation sequencing approaches fail to capture much of the genome, which can hinder downstream inference. Low‐coverage WGS combined with imputation using a high‐confidence reference panel is a cost‐effective alternative, but the accuracy of genotyping using low‐coverage WGS and imputation in nonmodel populations is still largely uncharacterized. Here, we empirically tested the accuracy of low‐coverage sequencing (0.1–10×) and imputation in two natural populations, one with a large (n = 741) reference panel, rhesus macaques (Macaca mulatta), and one with a smaller (n = 68) reference panel, gelada monkeys (Theropithecus gelada). Using samples sequenced to coverage as low as 0.5×, we could impute genotypes at >95% of the sites in the reference panel with high accuracy (medianr2 ≥ 0.92). We show that low‐coverage imputed genotypes can reliably calculate genetic relatedness and population structure. Based on these data, we also provide best practices and recommendations for researchers who wish to deploy this approach in other populations, with all code available on GitHub (https://github.com/mwatowich/LoCSI‐for‐non‐model‐species). Our results endorse accurate and effective genotype imputation from low‐coverage sequencing, enabling the cost‐effective generation of population‐scale genetic datasets necessary for tackling many pressing challenges of wildlife conservation.

     
    more » « less
  2. Abstract

    Plant collections held by botanic gardens and arboreta are key components of ex situ conservation. Maintaining genetic diversity in such collections allows them to be used as resources for supplementing wild populations. However, most recommended minimum sample sizes for sufficient ex situ genetic diversity are based on microsatellite markers, and it remains unknown whether these sample sizes remain valid in light of more recently developed next‐generation sequencing (NGS) approaches. To address this knowledge gap, we examine how ex situ conservation status and sampling recommendations differ when derived from microsatellites and single nucleotide polymorphisms (SNPs) in garden and wild samples of two threatened oak species. ForQuercus acerifolia, SNPs show lower ex situ representation of wild allelic diversity and slightly lower minimum sample size estimates than microsatellites, while results for each marker are largely similar forQ. boyntonii. The application of missing data filters tends to lead to higher ex situ representation, while the impact of different SNP calling approaches is dependent on the species being analyzed. Measures of population differentiation within species are broadly similar between markers, but larger numbers of SNP loci allow for greater resolution of population structure and clearer assignment of ex situ individuals to wild source populations. Our results offer guidance for future ex situ conservation assessments utilizing SNP data, such as the application of missing data filters and the usage of a reference genome, and illustrate that both microsatellites and SNPs remain viable options for botanic gardens and arboreta seeking to ensure the genetic diversity of their collections.

     
    more » « less
  3. Abstract Objectives

    Long‐tailed macaques (Macaca fascicularis) are widely distributed throughout the mainland and islands of Southeast Asia, making them a useful model for understanding the complex biogeographical history resulting from drastic changes in sea levels throughout the Pleistocene. Past studies based on mitochondrial genomes (mitogenomes) of long‐tailed macaque museum specimens have traced their colonization patterns throughout the archipelago, but mitogenomes trace only the maternal history. Here, our objectives were to trace phylogeographic patterns of long‐tailed macaques using low‐coverage nuclear DNA (nDNA) data from museum specimens.

    Methods

    We performed population genetic analyses and phylogenetic reconstruction on nuclear single nucleotide polymorphisms (SNPs) from shotgun sequencing of 75 long‐tailed macaque museum specimens from localities throughout Southeast Asia.

    Results

    We show that shotgun sequencing of museum specimens yields sufficient genome coverage (average ~1.7%) for reconstructing population relationships using SNP data. Contrary to expectations of divergent results between nuclear and mitochondrial genomes for a female philopatric species, phylogeographical patterns based on nuclear SNPs proved to be closely similar to those found using mitogenomes. In particular, population genetic analyses and phylogenetic reconstruction from the nDNA identify two major clades withinM. fascicularis: Clade A includes all individuals from the mainland along with individuals from northern Sumatra, while Clade B consists of the remaining island‐living individuals, including those from southern Sumatra.

    Conclusions

    Overall, we demonstrate that low‐coverage sequencing of nDNA from museum specimens provides enough data for examining broad phylogeographic patterns, although greater genome coverage and sequencing depth would be needed to distinguish between very closely related populations, such as those throughout the Philippines.

     
    more » « less
  4. Abstract

    Wallace's riverine barrier hypothesis postulates that large rivers, such as the Amazon and its tributaries, reduce or prevent gene flow between populations on opposite banks, leading to allopatry and areas of species endemism occupying interfluvial regions. Several studies have shown that two major tributaries, Rio Branco and Rio Negro, are important barriers to gene flow for birds, amphibians and primates. No botanical studies have considered the potential role of the Rio Branco as a barrier, while a single botanical study has evaluated the Rio Negro as a barrier. We studied an Amazon shrub,Amphirrhox longifolia(A. St.‐Hil.) Spreng (Violaceae), as a model to test the riverine barrier hypothesis. Twenty‐six populations ofA. longifoliawere sampled on both banks of the Rio Branco and Rio Negro in the core Amazon Basin. Double‐digestRADseq was used to identify 8,010 unlinkedSNPmarkers from the nuclear genome of 156 individuals. Data relating to population structure support the hypothesis that the Rio Negro acted as a significant genetic barrier forA. longifolia. On the other hand, no genetic differentiation was detected among populations spanning the narrower Rio Branco, which is a tributary of the Rio Negro. This study shows that the strength of riverine barriers for Amazon plants is dependent on the width of the river separating populations and species‐specific dispersal traits. Future studies of plants with contrasting life history traits will further improve our understanding of the landscape genetics and allopatric speciation history of Amazon plant diversity.

     
    more » « less
  5. Abstract

    Over the past few decades, there has been an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining data sets to achieve unprecedented sample sizes, spatial coverage or temporal replication in population genomic studies. However, a common concern is that nonbiological differences between data sets may generate patterns of variation in the data that can confound real biological patterns, a problem known as batch effects. In this paper, we compare two batches of low‐coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch‐effect‐naive” bioinformatic pipeline, batch effects systematically biased our genetic diversity estimates, population structure inference and selection scans. We then demonstrate that these batch effects resulted from multiple technical differences between our data sets, including the sequencing chemistry (four‐channel vs. two‐channel), sequencing run, read type (single‐end vs. paired‐end), read length (125 vs. 150 bp), DNA degradation level (degraded vs. well preserved) and sequencing depth (0.8× vs. 0.3× on average). Lastly, we illustrate that a set of simple bioinformatic strategies (such as different read trimming and single nucleotide polymorphism filtering) can be used to detect batch effects in our data and substantially mitigate their impact. We conclude that combining data sets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

     
    more » « less