skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, November 14 until 2:00 AM ET on Saturday, November 15 due to maintenance. We apologize for the inconvenience.


Title: Best Practices in Designing, Sequencing, and Identifying Random DNA Barcodes
Abstract Random DNA barcodes are a versatile tool for tracking cell lineages, with applications ranging from development to cancer to evolution. Here, we review and critically evaluate barcode designs as well as methods of barcode sequencing and initial processing of barcode data. We first demonstrate how various barcode design decisions affect data quality and propose a new design that balances all considerations that we are currently aware of. We then discuss various options for the preparation of barcode sequencing libraries, including inline indices and Unique Molecular Identifiers (UMIs). Finally, we test the performance of several established and new bioinformatic pipelines for the extraction of barcodes from raw sequencing reads and for error correction. We find that both alignment and regular expression-based approaches work well for barcode extraction, and that error-correction pipelines designed specifically for barcode data are superior to generic ones. Overall, this review will help researchers to approach their barcoding experiments in a deliberate and systematic way.  more » « less
Award ID(s):
2109800
PAR ID:
10391941
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Molecular Evolution
Volume:
91
Issue:
3
ISSN:
0022-2844
Page Range / eLocation ID:
p. 263-280
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract A barcode is a finite multiset of intervals on the real line. Jaramillo-Rodriguez (2023) previously defined a map from the space of barcodes with a fixed number of bars to a set of multipermutations, which presented new combinatorial invariants on the space of barcodes. A partial order can be defined on these multipermutations, resulting in a class of posets known as combinatorial barcode lattices. In this paper, we provide a number of equivalent definitions for the combinatorial barcode lattice, show that its Möbius function is a restriction of the Möbius function of the symmetric group under the weak Bruhat order, and show its ground set is the Jordan-Hölder set of a labeled poset. Furthermore, we obtain formulas for the number of join-irreducible elements, the rank-generating function, and the number of maximal chains of combinatorial barcode lattices. Lastly, we make connections between intervals in the combinatorial barcode lattice and certain classes of matchings. 
    more » « less
  2. Harris, Kelley (Ed.)
    Measuring the fitnesses of genetic variants is a fundamental objective in evolutionary biology. A standard approach for measuring microbial fitnesses in bulk involves labeling a library of genetic variants with unique sequence barcodes, competing the labeled strains in batch culture, and using deep sequencing to track changes in the barcode abundances over time. However, idiosyncratic properties of barcodes can induce nonuniform amplification or uneven sequencing coverage that causes some barcodes to be over- or under-represented in samples. This systematic bias can result in erroneous read count trajectories and misestimates of fitness. Here, we develop a computational method, named REBAR (Removing the Effects of Bias through Analysis of Residuals), for inferring the effects of barcode processing bias by leveraging the structure of systematic deviations in the data. We illustrate this approach by applying it to two independent data sets, and demonstrate that this method estimates and corrects for bias more accurately than standard proxies, such as GC-based corrections. REBAR mitigates bias and improves fitness estimates in high-throughput assays without introducing additional complexity to the experimental protocols, with potential applications in a range of experimental evolution and mutation screening contexts. 
    more » « less
  3. null (Ed.)
    Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE . 
    more » « less
  4. Abstract Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR‐based generation of DNA references into shotgun sequencing‐based “genome skimming” alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its “DNA‐mark”) for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such “DNA‐marks,” it will enable future DNA‐based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level. 
    more » « less
  5. Abstract SummaryWith the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less