skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Rapid, robust plasmid verification by de novo assembly of short sequencing reads
Abstract Plasmids are a foundational tool for basic and applied research across all subfields of biology. Increasingly, researchers in synthetic biology are relying on and developing massive libraries of plasmids as vectors for directed evolution, combinatorial gene circuit tests, and for CRISPR multiplexing. Verification of plasmid sequences following synthesis is a crucial quality control step that creates a bottleneck in plasmid fabrication workflows. Crucially, researchers often elect to forego the cumbersome verification step, potentially leading to reproducibility and—depending on the application—security issues. In order to facilitate plasmid verification to improve the quality and reproducibility of life science research, we developed a fast, simple, and open source pipeline for assembly and verification of plasmid sequences from Illumina reads. We demonstrate that our pipeline, which relies on de novo assembly, can also be used to detect contaminating sequences in plasmid samples. In addition to presenting our pipeline, we discuss the role for verification and quality control in the increasingly complex life science workflows ushered in by synthetic biology.  more » « less
Award ID(s):
1934573
PAR ID:
10302166
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Nucleic Acids Research
Volume:
48
Issue:
18
ISSN:
0305-1048
Page Range / eLocation ID:
e106 to e106
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mienda, Bashir Sajo (Ed.)
    Engineered plasmids have been workhorses of recombinant DNA technology for nearly half a century. Plasmids are used to clone DNA sequences encoding new genetic parts and to reprogram cells by combining these parts in new ways. Historically, many genetic parts on plasmids were copied and reused without routinely checking their DNA sequences. With the widespread use of high-throughput DNA sequencing technologies, we now know that plasmids often contain variants of common genetic parts that differ slightly from their canonical sequences. Because the exact provenance of a genetic part on a particular plasmid is usually unknown, it is difficult to determine whether these differences arose due to mutations during plasmid construction and propagation or due to intentional editing by researchers. In either case, it is important to understand how the sequence changes alter the properties of the genetic part. We analyzed the sequences of over 50,000 engineered plasmids using depositor metadata and a metric inspired by the natural language processing field. We detected 217 uncatalogued genetic part variants that were especially widespread or were likely the result of convergent evolution or engineering. Several of these uncatalogued variants are known mutants of plasmid origins of replication or antibiotic resistance genes that are missing from current annotation databases. However, most are uncharacterized, and 3/5 of the plasmids we analyzed contained at least one of the uncatalogued variants. Our results include a list of genetic parts to prioritize for refining engineered plasmid annotation pipelines, highlight widespread variants of parts that warrant further investigation to see whether they have altered characteristics, and suggest cases where unintentional evolution of plasmid parts may be affecting the reliability and reproducibility of science. 
    more » « less
  2. Despite the wide use of plasmids in research and clinical production, the need to verify plasmid sequences is a bottleneck that is too often underestimated in the manufacturing process. Although sequencing platforms continue to improve, the method and assembly pipeline chosen still influence the final plasmid assembly sequence. Furthermore, few dedicated tools exist for plasmid assembly, especially for de novo assembly. Here, we evaluated short-read, long-read, and hybrid (both short and long reads) de novo assembly pipelines across three replicates of a 24-plasmid library. Consistent with previous characterizations of each sequencing technology, short-read assemblies had issues resolving GC-rich regions, and long-read assemblies commonly had small insertions and deletions, especially in repetitive regions. The hybrid approach facilitated the most accurate, consistent assembly generation and identified mutations relative to the reference sequence. Although Sanger sequencing can be used to verify specific regions, some GC-rich and repetitive regions were difficult to resolve using any method, suggesting that easily sequenced genetic parts should be prioritized in the design of new genetic constructs. 
    more » « less
  3. null (Ed.)
    Staphylococci can cause a wide array of infections that can be life threatening. These infections become more deadly when the isolates are antibiotic resistant and thus harder to treat. Many resistance determinants are plasmid-mediated; however, staphylococcal plasmids have not yet been fully characterized. In particular, plasmids and their contributions to antibiotic resistance have not been investigated within the Arab states, where antibiotic use is not universally regulated. Here, we characterized the putative plasmid content among 56 Staphylococcus aureus and 10 Staphylococcus haemolyticus clinical isolates from Alexandria, Egypt. Putative plasmid sequences were detected in over half of our collection. In total, we identified 72 putative plasmid sequences in 27 S. aureus and 1 S. haemolyticus isolates. While these isolates typically carried one or two plasmids, we identified one isolate— S. aureus AA53—with 11 putative plasmids. The plasmid sequences most frequently encoded a Rep_1, RepL, or PriCT_1 type replication protein. As expected, antibiotic resistance genes were widespread among the identified plasmid sequences. Related plasmids were identified amongst our clinical isolates; homologous plasmids present in multiple isolates clustered into 11 groups based upon sequence similarity. Plasmids from the same cluster often shared antibiotic resistance genes, including blaZ , which is associated with β-lactam resistance. Our analyses suggest that plasmids are a key factor in the pathology and epidemiology of S. aureus in Egypt. A better characterization of plasmids and the role they contribute to the success of Staphylococci as pathogens will guide the design of effective control strategies to limit their spread. 
    more » « less
  4. Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this ACL 2024 theme track paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at: https://github.com/datadreamer-dev/DataDreamer. 
    more » « less
  5. null (Ed.)
    Scientific data, its analysis, accuracy, completeness, and reproducibility play a vital role in advancing science and engineering. Open Science Chain (OSC) is a cyberinfrastructure platform built using the Hyperledger Fabric (HLF) blockchain technology to address issues related to data reproducibility and accountability in scientific research. OSC preserves the integrity of research datasets and enables different research groups to share datasets with the integrity information. Additionally, it enables quick verification of the exact datasets that were used for a particular published research and tracks its provenance. In this paper, we describe OSC’s command line utility that will preserve the integrity of research datasets from within the researchers’ environment or from remote systems such as HPC resources or campus clusters used for research. The Python-based command line utility can be seamlessly integrated within research workflows and provides an easy way to preserve the integrity of research data in OSC blockchain platform. 
    more » « less