skip to main content


Title: FrameD: framework for DNA-based data storage design, verification, and validation
Abstract Motivation

DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.

Results

We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.

Availability and implementation

The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed, (https://dx.doi.org/10.5281/zenodo.7757762).

 
more » « less
Award ID(s):
1901324 2027655
NSF-PAR ID:
10468401
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
10
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.

    Availability and implementation

    Our software is available open source at https://github.com/nishatbristy007/NSB.

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

     
    more » « less
  2. Alkan, Can (Ed.)
    Abstract Motivation

    Detection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments.

    Results

    We show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.

    Availability and implementation

    https://github.com/joshidhaivat/HQAlign.git.

     
    more » « less
  3. Abstract Motivation

    Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing – SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity.

    Results

    We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

    Availability and implementation

    lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level's parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm^2 by ~12X compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6X higher throughput per Watt. 
    more » « less
  5. Abstract Aim

    Lineage fusion (merging of two or more populations of a species resulting in a single panmictic group) is a special case of secondary contact. It has the potential to counteract diversification and speciation, or to facilitate it through creation of novel genotypes. Understanding the prevalence of lineage fusion in nature requires reliable detection of it, such that efficient summary statistics are needed. Here, we report on simulations that characterized the initial intensity and subsequent decay of signatures of past fusion for 17 summary statistics applicable to DNA sequence haplotype data.

    Location

    Global.

    Taxa

    Diploid out‐crossing species.

    Methods

    We considered a range of scenarios that could reveal the impacts of different combinations of read length versus number of loci (arrangement of DNA sequence data), and whether or not pre‐fusion populations experienced bottlenecks coinciding with their divergence (historical context of fusion). Post‐fusion gene pools were sampled along 10 successive time points representing increasing lag times following merging of sister populations, and summary statistic values were recalculated at each.

    Results

    Many summary statistics were able to detect signatures of complete merging of populations after a sampling lag time of 1.5Negenerations, but the most informative ones included two neutrality tests and four diversity metrics, withZnS(a linkage disequilibrium‐based neutrality test) being particularly powerful. Correlation was relatively low among the two neutrality tests and two of the diversity metrics. There were clear benefits of many short (200‐bp × 200) loci over a handful of long (4‐kb × 10) loci. Also, only the latter genetic dataset type showed impacts of bottlenecks during divergence upon the number of informative summary statistics.

    Main conclusions

    This work contributes to identifying cases of lineage fusion, and advances phylogeography by enabling more nuanced reconstructions of how individual species, or multiple members of an ecological community, responded to past environmental change.

     
    more » « less