skip to main content

Search for: All records

Creators/Authors contains: "Rochette, Nicolas C."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD‐seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification‐related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.

    more » « less
    Free, publicly-accessible full text available August 1, 2024
  2. Abstract

    Restriction‐site associated DNA sequencing (RADseq) has become a powerful and versatile tool in modern population genomics, enabling large‐scale evolutionary and genomic analyses in otherwise inaccessible biological systems. With its widespread use, different variants on the protocol have been developed to suit specific experimental needs. Researchers face the challenge of choosing the optimal molecular and sequencing protocols for their reduced representation experimental design, an often‐complicated process. Strategic errors can lead to biased data generation that has reduced power to answer biological questions. Here, we present RADinitio, simulation software for the selection and optimization of RADseq experiments via the generation of sequencing data that behave similarly to empirical sources. RADinitio provides an evolutionary simulation of populations, implementation of various RADseq protocols with customizable parameters, and thorough assessment of missing data. We test the efficacy of the software using different RAD protocols across several organisms, highlighting the importance of protocol selection on the magnitude and quality of data acquired. Additionally, we test the effects of RAD library preparation and sequencing on allelic dropout, observing that library preparation and sequencing often contributes more to missing alleles than population‐level variation.

    more » « less
  3. Abstract

    For half a century population genetics studies have put type II restriction endonucleases to work. Now, coupled with massively‐parallel, short‐read sequencing, the family of RAD protocols that wields these enzymes has generated vast genetic knowledge from the natural world. Here, we describe the first software natively capable of using paired‐end sequencing to derive short contigs from de novo RAD data. Stacks version 2 employs a de Bruijn graph assembler to build and connect contigs from forward and reverse reads for each de novo RAD locus, which it then uses as a reference for read alignments. The new architecture allows all the individuals in a metapopulation to be considered at the same time as each RAD locus is processed. This enables a Bayesian genotype caller to provide precise SNPs, and a robust algorithm to phase those SNPs into long haplotypes, generating RAD loci that are 400–800 bp in length. To prove its recall and precision, we tested the software with simulated data and compared reference‐aligned and de novo analyses of three empirical data sets. Our study shows that the latest version of Stacks is highly accurate and outperforms other software in assembling and genotyping paired‐end de novo data sets.

    more » « less