Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
more »
« less
Batch effects in population genomic studies with low‐coverage whole genome sequencing data: Causes, detection and mitigation
Over the past few decades, there has been an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining data sets to achieve unprecedented sample sizes, spatial coverage or temporal replication in population genomic studies. However, a common concern is that nonbiological differences between data sets may generate patterns of variation in the data that can confound real biological patterns, a problem known as batch effects. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects systematically biased our genetic diversity estimates, population structure inference and selection scans. We then demonstrate that these batch effects resulted from multiple technical differences between our data sets, including the sequencing chemistry (four-channel vs. two-channel), sequencing run, read type (single-end vs. paired-end), read length (125 vs. 150 bp), DNA degradation level (degraded vs. well preserved) and sequencing depth (0.8× vs. 0.3× on average). Lastly, we illustrate that a set of simple bioinformatic strategies (such as different read trimming and single nucleotide polymorphism filtering) can be used to detect batch effects in our data and substantially mitigate their impact. We conclude that combining data sets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.
more »
« less
- Award ID(s):
- 1756316
- PAR ID:
- 10317278
- Date Published:
- Journal Name:
- Molecular Ecology Resources
- ISSN:
- 1755-098X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Costantini, Maria (Ed.)Abstract Genetic data from nonmodel species can inform ecology and physiology, giving insight into a species’ distribution and abundance as well as their responses to changing environments, all of which are important for species conservation and management. Moreover, reduced sequencing costs and improved long-read sequencing technology allows researchers to readily generate genomic resources for nonmodel species. Here, we apply Oxford Nanopore long-read sequencing and low-coverage (∼1x) whole genome short-read sequencing technology (Illumina) to assemble a genome and examine population genetics of an abundant tropical and subtropical fish, the hardhead silverside (Atherinomorus stipes). These fish are found in shallow coastal waters and are frequently included in ecological models because they serve as abundant prey for commercially and ecologically important species. Despite their importance in sub-tropical and tropical ecosystems, little is known about their population connectivity and genetic diversity. Our A. stipes genome assembly is about 1.2 Gb with comparable repetitive element content (∼47%), number of protein duplication events, and DNA methylation patterns to other teleost fish species. Among five sampled populations spanning 43 km of South Florida and the Florida Keys, we find little population structure suggesting high population connectivity.more » « less
-
Harris, Kelley (Ed.)Measuring the fitnesses of genetic variants is a fundamental objective in evolutionary biology. A standard approach for measuring microbial fitnesses in bulk involves labeling a library of genetic variants with unique sequence barcodes, competing the labeled strains in batch culture, and using deep sequencing to track changes in the barcode abundances over time. However, idiosyncratic properties of barcodes can induce nonuniform amplification or uneven sequencing coverage that causes some barcodes to be over- or under-represented in samples. This systematic bias can result in erroneous read count trajectories and misestimates of fitness. Here, we develop a computational method, named REBAR (Removing the Effects of Bias through Analysis of Residuals), for inferring the effects of barcode processing bias by leveraging the structure of systematic deviations in the data. We illustrate this approach by applying it to two independent data sets, and demonstrate that this method estimates and corrects for bias more accurately than standard proxies, such as GC-based corrections. REBAR mitigates bias and improves fitness estimates in high-throughput assays without introducing additional complexity to the experimental protocols, with potential applications in a range of experimental evolution and mutation screening contexts.more » « less
-
Abstract Changes in telomere length are increasingly used to indicate species' response to environmental stress across diverse taxa. Despite this broad use, few studies have explored telomere length in plants. Thus, evaluation of new approaches for measuring telomeres in plants is needed. Rapid advances in sequencing approaches and bioinformatic tools now allow estimation of telomere content from whole‐genome sequencing (WGS) data, a proxy for telomere length. While telomere content has been quantified extensively using quantitative polymerase chain reaction (qPCR) and WGS in humans, no study to date has compared the effectiveness of WGS in estimating telomere length in plants relative to qPCR approaches. In this study, we use 100Populusclones re‐sequenced using short‐read Illumina sequencing to quantify telomere length comparing three different bioinformatic approaches (Computel, K‐seek and TRIP) in addition to qPCR. Overall, telomere length estimates varied across different bioinformatic approaches, but were highly correlated across methods for individual genotypes. A positive correlation was observed between WGS estimates and qPCR, however, Computel estimates exhibited the greatest correlation. Computel incorporates genome coverage into telomere length calculations, suggesting that genome coverage is likely important to telomere length quantification when using WGS data. Overall, telomere estimates from WGS provided greater precision and accuracy of telomere length estimates relative to qPCR. The findings suggest WGS is a promising approach for assessing telomere length and, as the field of telomere ecology evolves, may provide added value to assaying response to biotic and abiotic environments for plants needed to accelerate plant breeding and conservation management.more » « less
-
Abstract Low‐coverage whole‐genome sequencing (WGS) is increasingly used for the study of evolution and ecology in both model and non‐model organisms; however, effective application of low‐coverage WGS data requires the implementation of probabilistic frameworks to account for the uncertainties in genotype likelihoods.Here, we present a probabilistic framework for using genotype likelihoods for standard population assignment applications. Additionally, we derive the Fisher information for allele frequency from genotype likelihoods and use that to describe a novel metric, theeffective sample size, which figures heavily in assignment accuracy. We make these developments available for application through WGSassign, an open‐source software package that is computationally efficient for working with whole‐genome data.Using simulated and empirical data sets, we demonstrate the behaviour of our assignment method across a range of population structures, sample sizes and read depths. Through these results, we show that WGSassign can provide highly accurate assignment, even for samples with low average read depths (<0.01X) and among weakly differentiated populations.Our simulation results highlight the importance of equalizing the effective sample sizes among source populations in order to achieve accurate population assignment with low‐coverage WGS data. We further provide study design recommendations for population assignment studies and discuss the broad utility of effective sample size for studies using low‐coverage WGS data.more » « less
An official website of the United States government

