skip to main content

Title: IMG-DNA: Approximate DNA Storage for Images
Deoxyribonucleic Acid (DNA) as a storage medium with high density and long-term preservation properties can satisfy the requirement of archival storage for rapidly increased digital volume. The read and write processes of DNA storage are error-prone. Images widely used in social media have the properties of fault tolerance which are well fitted to the DNA storage. However, prior work simply investigated the feasibility of DNA storage storing different types of data and simply store images in DNA storage, which did not fully investigate the fault-tolerant potential of images in the DNA storage. In this paper, we proposed a new image-based DNA system called IMG-DNA, which can efficiently store images in DNA storage with improved DNA storage robustness. First, a new DNA architecture is proposed to fit JPEG-based images and improve the image’s robustness in DNA storage. Moreover, barriers inserted in DNA sequences efficiently prevent error propagation in images of DNA storage. The experimental results indicate that the proposed IMG-DNA achieves much higher fault-tolerant than prior work.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
International Systems and Storage Conference (SYSTOR’21)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the rapid increase of available digital data, we are searching for a storage media with high density and capability of long-term preservation. Deoxyribonucleic Acid (DNA) storage is identified as such a promising candidate, especially for archival storage systems. However, the encoding density (i.e., how many binary bits can be encoded into one nucleotide) and error handling are two major factors intertwined in DNA storage. Considering encoding density, theoretically, one nucleotide (i.e., A, T, G, or C) can encode two binary bits (upper bound). However, due to biochemical constraints and other necessary information associated with payload, currently the encoding densities of various DNA storage systems are much less than this upper bound. Additionally, all existing studies of DNA encoding schemes are based on static analysis and really lack the awareness of dynamically changed digital patterns. Therefore, the gap between the static encoding and dynamic binary patterns prevents achieving a higher encoding density for DNA storage systems. In this paper, we propose a new Digital Pattern-Aware DNA storage system, called DP-DNA, which can efficiently store digital data in the DNA storage with high encoding density. DP-DNA maintains a set of encoding codes and uses a digital pattern-aware code (DPAC) to analyze the patterns of a binary sequence for a DNA strand and selects an appropriate code for encoding the binary sequence to achieve a high encoding density. An additional encoding field is added to the DNA encoding format, which can distinguish the encoding scheme used for those DNA strands, and thus we can decode DNA data back to its original digital data. Moreover, to further improve the encoding density, a variable-length scheme is proposed to increase the feasibility of the code scheme with a high encoding density. Finally, the experimental results indicate that the proposed DP-DNA achieves up to 103.5% higher encoding densities than prior work. 
    more » « less
  2. INTRODUCTION Solving quantum many-body problems, such as finding ground states of quantum systems, has far-reaching consequences for physics, materials science, and chemistry. Classical computers have facilitated many profound advances in science and technology, but they often struggle to solve such problems. Scalable, fault-tolerant quantum computers will be able to solve a broad array of quantum problems but are unlikely to be available for years to come. Meanwhile, how can we best exploit our powerful classical computers to advance our understanding of complex quantum systems? Recently, classical machine learning (ML) techniques have been adapted to investigate problems in quantum many-body physics. So far, these approaches are mostly heuristic, reflecting the general paucity of rigorous theory in ML. Although they have been shown to be effective in some intermediate-size experiments, these methods are generally not backed by convincing theoretical arguments to ensure good performance. RATIONALE A central question is whether classical ML algorithms can provably outperform non-ML algorithms in challenging quantum many-body problems. We provide a concrete answer by devising and analyzing classical ML algorithms for predicting the properties of ground states of quantum systems. We prove that these ML algorithms can efficiently and accurately predict ground-state properties of gapped local Hamiltonians, after learning from data obtained by measuring other ground states in the same quantum phase of matter. Furthermore, under a widely accepted complexity-theoretic conjecture, we prove that no efficient classical algorithm that does not learn from data can achieve the same prediction guarantee. By generalizing from experimental data, ML algorithms can solve quantum many-body problems that could not be solved efficiently without access to experimental data. RESULTS We consider a family of gapped local quantum Hamiltonians, where the Hamiltonian H ( x ) depends smoothly on m parameters (denoted by x ). The ML algorithm learns from a set of training data consisting of sampled values of x , each accompanied by a classical representation of the ground state of H ( x ). These training data could be obtained from either classical simulations or quantum experiments. During the prediction phase, the ML algorithm predicts a classical representation of ground states for Hamiltonians different from those in the training data; ground-state properties can then be estimated using the predicted classical representation. Specifically, our classical ML algorithm predicts expectation values of products of local observables in the ground state, with a small error when averaged over the value of x . The run time of the algorithm and the amount of training data required both scale polynomially in m and linearly in the size of the quantum system. Our proof of this result builds on recent developments in quantum information theory, computational learning theory, and condensed matter theory. Furthermore, under the widely accepted conjecture that nondeterministic polynomial-time (NP)–complete problems cannot be solved in randomized polynomial time, we prove that no polynomial-time classical algorithm that does not learn from data can match the prediction performance achieved by the ML algorithm. In a related contribution using similar proof techniques, we show that classical ML algorithms can efficiently learn how to classify quantum phases of matter. In this scenario, the training data consist of classical representations of quantum states, where each state carries a label indicating whether it belongs to phase A or phase B . The ML algorithm then predicts the phase label for quantum states that were not encountered during training. The classical ML algorithm not only classifies phases accurately, but also constructs an explicit classifying function. Numerical experiments verify that our proposed ML algorithms work well in a variety of scenarios, including Rydberg atom systems, two-dimensional random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases. CONCLUSION We have rigorously established that classical ML algorithms, informed by data collected in physical experiments, can effectively address some quantum many-body problems. These rigorous results boost our hopes that classical ML trained on experimental data can solve practical problems in chemistry and materials science that would be too hard to solve using classical processing alone. Our arguments build on the concept of a succinct classical representation of quantum states derived from randomized Pauli measurements. Although some quantum devices lack the local control needed to perform such measurements, we expect that other classical representations could be exploited by classical ML with similarly powerful results. How can we make use of accessible measurement data to predict properties reliably? Answering such questions will expand the reach of near-term quantum platforms. Classical algorithms for quantum many-body problems. Classical ML algorithms learn from training data, obtained from either classical simulations or quantum experiments. Then, the ML algorithm produces a classical representation for the ground state of a physical system that was not encountered during training. Classical algorithms that do not learn from data may require substantially longer computation time to achieve the same task. 
    more » « less
  3. Abstract Motivation

    DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge, we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components.


    We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using multiple sequence alignment algorithms and others that do not. We found that the choice to include multiple sequence alignment in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems.

    Availability and implementation

    The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at, (

    more » « less
  4. null (Ed.)
    Fault-tolerant virtual machine (VM) placement refers to the process of placing multiple copies of the same VM cloud application inside cloud data centers. The challenge is how to place required number of VM replicas while minimizing the number of physical machines (PMs) that store them, in order to save energy consumption of cloud data centers. We refer to it as fault-tolerant VM placement problem. In our previous work, we have proposed a greedy algorithm to solve this problem. In this paper, we compare it with an existing research that is based on well-known Welsh Powell Graph-Coloring Algorithm to place items into bins while considering the conflicts between items and items and items and bins. Via extensive simulations, we show that our greedy algorithm can turn off 40-50% more PMs than existing work and can place upto four times as many VM replicas as existing work, achieving much stronger fault-tolerance with less energy consumption. We also compare both algorithms with the optimal integer lin- ear programming (ILP)-based algorithm, which serves as the benchmark of the comparison. 
    more » « less
  5. Wang, H. (Ed.)
    Once upon a time databases were structured, one size fitted all and they resided on machines that were trustworthy and even when they failed, they simply crashed. This era has come and gone as eloquently stated by Stonebraker and Cetintemel [16]. We now have key-value stores, graph databases, text databases, and a myriad of unstructured data repositories. The database community has wholeheartedly accepted the fact that the same information might come in different formats, modes and representations. We also accept that data might not be ”clean” and that data might need to be ”cleaned” due to the diverse sources of information. However, we, as a database community still cling to our 20th century belief that databases always reside on trustworthy, honest servers. Although the database community has always considered fault-tolerance as an integral building block of data management (remember ”D” in ACID is for Durability), we still have trouble accepting the fact that not all failures are simply crash failures and might in fact involve malicious and non-trustworthy infrastructure. This notion has been challenged and abandoned by many other Computer Science communities, most notably the security and the distributed systems communities. The rise of the cloud computing paradigm as well as the rapid popularity of blockchains demand a rethinking of our na¨ıve, comfortable beliefs in an ideal benign infrastructure. In the cloud, clients store their sensitive data in remote servers owned and operated by cloud providers. The Security and Crypto Communities have made significant inroads to protect both data and access privacy from malicious untrusted storage providers using encryption and oblivious data stores. The Distributed Systems and the Systems Communities have developed consensus protocols to ensure the fault-tolerant maintenance of data residing on untrusted, malicious infrastructure. However, these solutions face significant scalability and performance challenges when incorporated in large scale data repositories. Novel database designs need to directly address the natural tension between performance, fault-tolerance and trustworthiness. This is a perfect setting for the database community to lead and guide. 
    more » « less