Motivated by mutation processes occurring in in vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length k, and at most a single substitution occurring at any time during the mutation process.
more »
« less
Single-Error Detection and Correction for Duplication and Substitution Channels
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length k, and at most a single substitution occurring at any time during the mutation process.
more »
« less
- PAR ID:
- 10111325
- Date Published:
- Journal Name:
- 2019 IEEE International Symposium on Information Theory (ISIT)
- Page Range / eLocation ID:
- 300 to 304
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash.more » « less
-
Abstract Following a duplication, the resulting paralogs tend to diverge. While mutation and natural selection can accelerate this process, they can also slow it. Here, we quantify the paralog homogenization that is caused by point mutations and interlocus gene conversion (IGC). Among 164 duplicated teleost genes, the median percentage of postduplication codon substitutions that arise from IGC rather than point mutation is estimated to be between 7% and 8%. By differentiating between the nonsynonymous codon substitutions that homogenize the protein sequences of paralogs and the nonhomogenizing nonsynonymous substitutions, we estimate the homogenizing nonsynonymous rates to be higher for 163 of the 164 teleost data sets as well as for all 14 data sets of duplicated yeast ribosomal protein-coding genes that we consider. For all 14 yeast data sets, the estimated homogenizing nonsynonymous rates exceed the synonymous rates.more » « less
-
Mitochondrial mutations in Caenorhabditis elegans show signatures of oxidative damage and an AT-biasSurtees, J A (Ed.)Abstract Rapid mutation rates are typical of mitochondrial genomes (mtDNAs) in animals, but it is not clear why. The difficulty of obtaining measurements of mtDNA mutation that are not biased by natural selection has stymied efforts to distinguish between competing hypotheses about the causes of high mtDNA mutation rates. Several studies which have measured mtDNA mutations in nematodes have yielded small datasets with conflicting conclusions about the relative abundance of different substitution classes (i.e., the mutation spectrum). We therefore leveraged Duplex Sequencing, a high-fidelity DNA sequencing technique, to characterize de novo mtDNA mutations in Caenorhabditis elegans. This approach detected nearly an order of magnitude more mtDNA mutations than documented in any previous nematode mutation study. Despite an existing extreme AT bias in the C. elegans mtDNA (75.6% AT), we found that a significant majority of mutations increase genomic AT content. Compared to some prior studies in nematodes and other animals, the mutation spectrum reported here contains an abundance of CG→AT transversions, supporting the hypothesis that oxidative damage may be a driver of mtDNA mutations in nematodes. Furthermore, we found an excess of G→T and C→T changes on the coding DNA strand relative to the template strand, consistent with increased exposure to oxidative damage. Analysis of the distribution of mutations across the mtDNA revealed significant variation among protein-coding genes and as well as among neighboring nucleotides. This high-resolution view of mitochondrial mutations in C. elegans highlights the value of this system for understanding relationships among oxidative damage, replication error, and mtDNA mutation.more » « less
-
In this paper, we propose a generalized millimeter-Wave (mmWave) reconfigurable antenna multiple-input multiple-output (RA-MIMO) architecture that takes advantage of lens antennas. The considered antennas can generate multiple independent beams simultaneously using a single RF chain. This property, together with RA-MIMO, is used to combat small-scale fading and shadowing in mmWave bands. To this end, first, we derive a channel matrix for RA-MIMO. Then, we use rate-one space-time block codes (STBCs), together with phase-shifters at the receive reconfigurable antennas, to suppress the effect of small-scale fading. We consider two kinds of phase shifters: i) ideal which is error-free and ii) digital which adds quantization error. The goal of phase-shifters is to convert a complex-valued channel matrix into real-valued. Hence, it is possible to use rate-one STBCs for any dimension of RA-MIMO. We investigate diversity gain and derive an upper bound for symbol error rate in cases of ideal and digital phase-shifters. We show that RA-MIMO achieves the full-diversity gain with ideal phase-shifters and the full-diversity gain for digital phase-shifters when the number of quantization bits is higher than one. We investigate RA-MIMO in the presence of shadowing. Our analysis demonstrates that, by increasing the dimension of RA-MIMO, the outage probability decreases which means the effect of shadowing decreases. Numerical results verify our theoretical derivations.more » « less
An official website of the United States government

