skip to main content

Title: Algorithms for reconstruction over single and multiple deletion channels
Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both the more » single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels. « less
Authors:
Award ID(s):
1705077 1740047
Publication Date:
NSF-PAR ID:
10273066
Journal Name:
IEEE transactions on information theory
Volume:
67
Issue:
6
Page Range or eLocation-ID:
3389-3410
ISSN:
1557-9654
Sponsoring Org:
National Science Foundation
More Like this
  1. The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.
  2. The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding criteria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.
  3. The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.
  4. Potassium ion (K+) channels have been observed in diverse viruses that infect eukaryotic marine and freshwater algae. However, experimental evidence for functional K+ channels among these alga-infecting viruses has thus far been restricted to members of the family Phycodnaviridae, which are large, double-stranded DNA viruses within the phylum Nucleocytoviricota. Recent sequencing projects revealed that alga-infecting members of Mimiviridae, another family within this phylum, may also contain genes encoding K+ channels. Here we examine the structural features and the functional properties of putative K+ channels from four cultivated members of Mimiviridae. While all four proteins contain variations of the conserved selectivity filter sequence of K+ channels, structural prediction algorithms suggest that only two of them have the required number and position of two transmembrane domains that are present in all K+ channels. After in vitro translation and reconstitution of the four proteins in planar lipid bilayers, we confirmed that one of them, a 79 amino acid protein from the virus Tetraselmis virus 1 (TetV-1), forms a functional ion channel with a distinct selectivity for K+ over Na+ and a sensitivity to Ba2+. Thus, virus-encoded K+ channels are not limited to Phycodnaviridae but also occur in the members of Mimiviridae. The largemore »sequence diversity among the viral K+ channels implies multiple events of lateral gene transfer.« less
  5. Maximum likelihood (ML) and symbolwise maximum aposteriori (MAP) estimation for discrete input sequences play a central role in a number of applications that arise in communications, information and coding theory. Many instances of these problems are proven to be intractable, for example through reduction to NP-complete integer optimization problems. In this work, we prove that the ML estimation of a discrete input sequence (with no assumptions on the encoder/channel used) is equivalent to the solution of a continuous non-convex optimization problem, and that this formulation is closely related to the computation of symbolwise MAP estimates. This equivalence is particularly useful in situations where a function we term the expected likelihood is efficiently computable. In such situations, we give a ML heuristic and show numerics for sequence estimation over the deletion channel.