Abstract MotivationSingle-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. ResultsHere, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. Availability and implementationPhylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.
more »
« less
Estimating Genome-Wide Phylogenies Using Probabilistic Topic Modeling
Abstract Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package TopicContml. The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation) to extract topic frequencies from k-mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program Contml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of TopicContml on simulated datasets with gaps and three biological datasets: 1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, 2) 5162 loci from 80 mammal species, and 3) raw, unaligned, nonorthologous PacBio sequences from 12 bird species. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure. Our empirical results and simulated data suggest that our method is efficient and statistically robust.
more »
« less
- Award ID(s):
- 2109989
- PAR ID:
- 10593923
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Systematic Biology
- ISSN:
- 1063-5157
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We developed a novel method for efficiently estimating time-varying selection coefficients from genome-wide ancient DNA data. In simulations, our method accurately recovers selective trajectories and is robust to misspecification of population size. We applied it to a large data set of ancient and present-day human genomes from Britain and identified seven loci with genome-wide significant evidence of selection in the past 4500 yr. Almost all of them can be related to increased vitamin D or calcium levels, suggesting strong selective pressure on these or related phenotypes. However, the strength of selection on individual loci varied substantially over time, suggesting that cultural or environmental factors moderated the genetic response. Of 28 complex anthropometric and metabolic traits, skin pigmentation was the only one with significant evidence of polygenic selection, further underscoring the importance of phenotypes related to vitamin D. Our approach illustrates the power of ancient DNA to characterize selection in human populations and illuminates the recent evolutionary history of Britain.more » « less
-
Abstract Whole‐genome sequencing data allow survey of variation from across the genome, reducing the constraint of balancing genome sub‐sampling with estimating recombination rates and linkage between sampled markers and target loci. As sequencing costs decrease, low‐coverage whole‐genome sequencing of pooled or indexed‐individual samples is commonly utilized to identify loci associated with phenotypes or environmental axes in non‐model organisms. There are, however, relatively few publicly available bioinformatic pipelines designed explicitly to analyse these types of data, and fewer still that process the raw sequencing data, provide useful metrics of quality control and then execute analyses. Here, we present an updated version of a bioinformatics pipeline calledPoolParty2that can effectively handle either pooled or indexed DNA samples and includes new features to improve computational efficiency. Using simulated data, we demonstrate the ability of our pipeline to recover segregating variants, estimate their allele frequencies accurately, and identify genomic regions harbouring loci under selection. Based on the simulated data set, we benchmark the efficacy of our pipeline with another bioinformatic suite,angsd, and illustrate the compatibility and complementarity of these suites usingangsdto generate genotype likelihoods as input for identifying linkage outlier regions using alignment files and variants provided byPoolParty2. Finally, we apply our updated pipeline to an empirical dataset of low‐coverage whole genomic data from population samples of Columbia River steelhead trout (Oncorhynchus mykiss), results from which demonstrate the genomic impacts of decades of artificial selection in a prominent hatchery stock. Thus, we not only demonstrate the utility ofPoolParty2for genomic studies that combine sequencing data from multiple individuals, but also illustrate how it compliments other bioinformatics resources such asangsd.more » « less
-
Abstract There is a growing focus on the role of DNA methylation in the ability of marine invertebrates to rapidly respond to changing environmental factors and anthropogenic impacts. However, genome‐wide DNA methylation studies in nonmodel organisms are currently hampered by a limited understanding of methodological biases. Here, we compare three methods for quantifying DNA methylation at single base‐pair resolution—whole genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS), and methyl‐CpG binding domain bisulfite sequencing (MBDBS)—using multiple individuals from two reef‐building coral species with contrasting environmental sensitivity. All methods reveal substantially greater methylation inMontipora capitata(11.4%) than the more sensitivePocillopora acuta(2.9%). The majority of CpG methylation in both species occurs in gene bodies and flanking regions. In both species, MBDBS has the greatest capacity for detecting CpGs in coding regions at our sequencing depth, but MBDBS may be influenced by intrasample methylation heterogeneity. RRBS yields robust information for specific loci albeit without enrichment of any particular genome feature and with significantly reduced genome coverage. Relative genome size strongly influences the number and location of CpGs detected by each method when sequencing depth is limited, illuminating nuances in cross‐species comparisons. As genome‐wide methylation differences, supported by data across bisulfite sequencing methods, may contribute to environmental sensitivity phenotypes in critical marine invertebrate taxa, these data provide a genomic resource for investigating the functional role of DNA methylation in environmental tolerance.more » « less
-
Ruane, Sara (Ed.)Abstract Comparisons of intraspecific genetic diversity across species can reveal the roles of geography, ecology, and life history in shaping biodiversity. The wide availability of mitochondrial DNA (mtDNA) sequences in open-access databases makes this marker practical for conducting analyses across several species in a common framework, but patterns may not be representative of overall species diversity. Here, we gather new and existing mtDNA sequences and genome-wide nuclear data (genotyping-by-sequencing; GBS) for 30 North American squamate species sampled in the Southeastern and Southwestern United States. We estimated mtDNA nucleotide diversity for 2 mtDNA genes, COI (22 species alignments; average 16 sequences) and cytb (22 species; average 58 sequences), as well as nuclear heterozygosity and nucleotide diversity from GBS data for 118 individuals (30 species; 4 individuals and 6,820 to 44,309 loci per species). We showed that nuclear genomic diversity estimates were highly consistent across individuals for some species, while other species showed large differences depending on the locality sampled. Range size was positively correlated with both cytb diversity (phylogenetically independent contrasts: R2 = 0.31, P = 0.007) and GBS diversity (R2 = 0.21; P = 0.006), while other predictors differed across the top models for each dataset. Mitochondrial and nuclear diversity estimates were not correlated within species, although sampling differences in the data available made these datasets difficult to compare. Further study of mtDNA and nuclear diversity sampled across species’ ranges is needed to evaluate the roles of geography and life history in structuring diversity across a variety of taxonomic groups.more » « less
An official website of the United States government
