skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On the causes, consequences, and avoidance of PCR duplicates: Towards a theory of library complexity
Abstract Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD‐seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification‐related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.  more » « less
Award ID(s):
1645087
PAR ID:
10473795
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Molecular Ecology Resources
Volume:
23
Issue:
6
ISSN:
1755-098X
Page Range / eLocation ID:
1299 to 1318
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Many applications in molecular ecology require the ability to match specific DNA sequences from single‐ or mixed‐species samples with a diagnostic reference library. Widely used methods for DNA barcoding and metabarcoding employ PCR and amplicon sequencing to identify taxa based on target sequences, but the target‐specific enrichment capabilities of CRISPR‐Cas systems may offer advantages in some applications. We identified 54,837 CRISPR‐Cas guide RNAs that may be useful for enriching chloroplast DNA across phylogenetically diverse plant species. We tested a subset of 17 guide RNAs in vitro to enrich plant DNA strands ranging in size from diagnostic DNA barcodes of 1,428 bp to entire chloroplast genomes of 121,284 bp. We used an Oxford Nanopore sequencer to evaluate sequencing success based on both single‐ and mixed‐species samples, which yielded mean chloroplast sequence lengths of 2,530–11,367 bp, depending on the experiment. In comparison to mixed‐species experiments, single‐species experiments yielded more on‐target sequence reads and greater mean pairwise identity between contigs and the plant species' reference genomes. But nevertheless, these mixed‐species experiments yielded sufficient data to provide ≥48‐fold increase in sequence length and better estimates of relative abundance for a commercially prepared mixture of plant species compared to DNA metabarcoding based on the chloroplasttrnL‐P6 marker. Prior work developed CRISPR‐based enrichment protocols for long‐read sequencing and our experiments pioneered its use for plant DNA barcoding and chloroplast assemblies that may have advantages over workflows that require PCR and short‐read sequencing. Future work would benefit from continuing to develop in vitro and in silico methods for CRISPR‐based analyses of mixed‐species samples, especially when the appropriate reference genomes for contig assembly cannot be known a priori. 
    more » « less
  2. Abstract BackgroundModern plant breeding strategies rely on the intensive use of advanced genomic tools to expedite the development of improved crop varieties. Genomic DNA extraction from crop seeds eliminates the need to grow plants in contrast to fresh leaf tissue; however, it can still be a bottleneck due to the presence of stored compounds and the complexity of the matrix. The interaction of environmentally benign choline-based ionic liquids (ILs) with DNA offers an innovative approach to enhance the quality of extracted DNA from seeds. While prior IL-based plant DNA extraction workflows have primarily supported polymerase chain reaction (PCR) and quantitative PCR-based applications, their suitability for high-throughput sequencing (HTS) remained largely unexplored. This study explores the efficacy of IL-assisted method for genomic DNA extraction from soybean (Glycine max) seeds, addressing the limited application of ILs in HTS. ResultsThe optimized DNA extraction method, utilizing 25% (w/v) choline formate, enabled the recovery of high-purity DNA with abundant fragment sizes > 20 kb, suitable for downstream applications including PCR, whole genome amplification (WGA), simple sequence repeat (SSR) amplification, and high-throughput Illumina sequencing. The IL-method was benchmarked against a silica-binding method using cetyltrimethylammonium bromide (CTAB) and sodium dodecyl sulfate (SDS) as lysis agents using a commercial plant DNA extraction kit in terms of DNA yield, purity, abundant DNA fragment size distribution, and integrity. In addition, DNA isolated from this method demonstrated successful PCR amplification of markers from both the nuclear and plastid genomes and yielded > 99% whole genome coverage with Illumina (PE150) sequencing reads. ConclusionsThis is the first known instance of a whole genome sequence generated from DNA extracted with ILs. These findings mark a significant milestone in establishing ILs as promising alternatives to conventional methods for seed DNA extraction, with potential utility in third generation (long-read) sequencing experiments. 
    more » « less
  3. As transposon sequencing (TnSeq) assays have become prolific in the microbiology field, it is of interest to scrutinize their potential drawbacks. TnSeq data consist of millions of nucleotide sequence reads that are generated by PCR amplification of transposon-genomic junctions. Reads mapping to the junctions are enumerated thus providing information on the number of transposon insertion mutations in each individual gene. Here we explore the possibility that PCR amplification of transposon insertions in a TnSeq library skews the results by introducing bias into the detection and/or enumeration of insertions. We compared the detection and frequency of mapped insertions when altering the number of PCR cycles, and when including a nested PCR, in the enrichment step. Additionally, we present nCATRAs - a novel, amplification-free TnSeq method where the insertions are enriched via CRISPR/Cas9-targeted transposon cleavage and subsequent Oxford Nanopore MinION sequencing. nCATRAs achieved 54 and 23% enrichment of the transposons and transposon-genomic junctions, respectively, over background genomic DNA. These PCR-based and PCR-free experiments demonstrate that, overall, PCR amplification does not significantly bias the results of TnSeq insofar as insertions in the majority of genes represented in our library were similarly detected regardless of PCR cycle number and whether or not PCR amplification was employed. However, the detection of a small subset of genes which had been previously described as essential is sensitive to the number of PCR cycles. We conclude that PCR-based enrichment of transposon insertions in a TnSeq assay is reliable, but researchers interested in profiling putative essential genes should carefully weigh the number of amplification cycles employed in their library preparation protocols. In addition, nCATRAs is comparable to traditional PCR-based methods (Kendall’s correlation=0.896–0.897) although the latter remain superior owing to their accessibility and high sequencing depth. 
    more » « less
  4. Abstract BackgroundThere is a growing demand for fast and reliable plant biomolecular analyses. DNA extraction is the major bottleneck in plant nucleic acid-based applications especially due to the complexity of tissues in different plant species. Conventional methods for plant cell lysis and DNA extraction typically require extensive sample preparation processes and large quantities of sample and chemicals, elevated temperatures, and multiple sample transfer steps which pose challenges for high throughput applications. ResultsIn a prior investigation, an ionic liquid (IL)-based modified vortex-assisted matrix solid phase dispersion approach was developed using the model plant,Arabidopsis thaliana(L.) Heynh. Building upon this foundational study, the present study established a simple, rapid and efficient protocol for DNA extraction from milligram fragments of plant tissue representing a diverse range of taxa from the plant Tree of Life including 13 dicots and 4 monocots. Notably, the approach was successful in extracting DNA from a century old herbarium sample. The isolated DNA was of sufficient quality and quantity for sensitive molecular analyses such as qPCR. Two plant DNA barcoding markers, the plastidrbcLand nuclear ribosomal internal transcribed spacer (nrITS) regions were selected for DNA amplification and Sanger sequencing was conducted on PCR products of a representative dicot and monocot species. Successful qPCR amplification of the extracted DNA up to 3 weeks demonstrated that the DNA extracted using this approach remains stable at room temperature for an extended time period prior to downstream analysis. ConclusionsThe method presented here is a rapid and simple approach enabling cell lysis and DNA extraction from 1.5 mg of plant tissue across a broad range of plant taxa. Additional purification prior to DNA amplification is not required due to the compatibility of the extraction solvents with qPCR. The method has tremendous potential for applications in plant biology that require DNA, including barcoding methods for agriculture, conservation, ecology, evolution, and forensics. 
    more » « less
  5. David, Lawrence A. (Ed.)
    ABSTRACT Shotgun metagenomic sequencing has transformed our understanding of microbial community ecology. However, preparing metagenomic libraries for high-throughput DNA sequencing remains a costly, labor-intensive, and time-consuming procedure, which in turn limits the utility of metagenomes. Several library preparation procedures have recently been developed to offset these costs, but it is unclear how these newer procedures compare to current standards in the field. In particular, it is not clear if all such procedures perform equally well across different types of microbial communities or if features of the biological samples being processed (e.g., DNA amount) impact the accuracy of the approach. To address these questions, we assessed how five different shotgun DNA sequence library preparation methods, including the commonly used Nextera Flex kit, perform when applied to metagenomic DNA. We measured each method’s ability to produce metagenomic data that accurately represent the underlying taxonomic and genetic diversity of the community. We performed these analyses across a range of microbial community types (e.g., soil, coral associated, and mouse gut associated) and input DNA amounts. We find that the type of community and amount of input DNA influence each method’s performance, indicating that careful consideration may be needed when selecting between methods, especially for low-complexity communities. However, the cost-effective preparation methods that we assessed are generally comparable to the current gold-standard Nextera DNA Flex kit for high-complexity communities. Overall, the results from this analysis will help expand and even facilitate access to metagenomic approaches in future studies. IMPORTANCE Metagenomic library preparation methods and sequencing technologies continue to advance rapidly, allowing researchers to characterize microbial communities in previously underexplored environmental samples and systems. However, widely accepted standardized library preparation methods can be cost-prohibitive. Newly available approaches may be less expensive, but their efficacy in comparison to standardized methods remains unknown. In this study, we compared five different metagenomic library preparation methods. We evaluated each method across a range of microbial communities varying in complexity and quantity of input DNA. Our findings demonstrate the importance of considering sample properties, including community type, composition, and DNA amount, when choosing the most appropriate metagenomic library preparation method. 
    more » « less