skip to main content


Title: On Maximum Likelihood Reconstruction over Multiple Deletion Channels
The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding criteria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms.  more » « less
Award ID(s):
1705077
NSF-PAR ID:
10058579
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE International Symposium on Information Theory (ISIT)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms. 
    more » « less
  2. The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding crite- ria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms. 
    more » « less
  3. Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both the single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels. 
    more » « less
  4. null (Ed.)
    Potassium ion (K+) channels have been observed in diverse viruses that infect eukaryotic marine and freshwater algae. However, experimental evidence for functional K+ channels among these alga-infecting viruses has thus far been restricted to members of the family Phycodnaviridae, which are large, double-stranded DNA viruses within the phylum Nucleocytoviricota. Recent sequencing projects revealed that alga-infecting members of Mimiviridae, another family within this phylum, may also contain genes encoding K+ channels. Here we examine the structural features and the functional properties of putative K+ channels from four cultivated members of Mimiviridae. While all four proteins contain variations of the conserved selectivity filter sequence of K+ channels, structural prediction algorithms suggest that only two of them have the required number and position of two transmembrane domains that are present in all K+ channels. After in vitro translation and reconstitution of the four proteins in planar lipid bilayers, we confirmed that one of them, a 79 amino acid protein from the virus Tetraselmis virus 1 (TetV-1), forms a functional ion channel with a distinct selectivity for K+ over Na+ and a sensitivity to Ba2+. Thus, virus-encoded K+ channels are not limited to Phycodnaviridae but also occur in the members of Mimiviridae. The large sequence diversity among the viral K+ channels implies multiple events of lateral gene transfer. 
    more » « less
  5. String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying errors consist of more than just single base differences. More often the phenomenon of “stutter” is observed, where the number of repeated units differs (by whole units) from the template. To adapt the Levenshtein distance to be suitable for forensic applications where DNA sequence similarity is of interest, a generalized string edit distance is defined that accommodates the addition or deletion of whole motifs in addition to single-nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences. The novelty of this algorithm is in handling the complex interactions that arise between multiple- and single-character edits. Forensic examples illustrate the purpose and use of the Restricted Forensic Levenshtein (RFL) distance measure, but applications extend to sequence alignment and string similarity in other biological areas, as well as dynamic programming algorithms more broadly. 
    more » « less