Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ time and$$\mathcal {O} (\ell \log \ell )$$ space, where |t| is the number of$$k$$ -mers inside the sketch of the reference, |p| is the number of$$k$$ -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
more »
« less
Identification of all-against-all protein–protein interactions based on deep hash learning
Abstract BackgroundProtein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming. ResultsIn this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database withMproteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship. ConclusionsThe experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from$$O(M^2)$$ to$$O(M\log M)$$ for performing an all-against-all PPI prediction for a database withMproteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.
more »
« less
- Award ID(s):
- 1816005
- PAR ID:
- 10368623
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- BMC Bioinformatics
- Volume:
- 23
- Issue:
- 1
- ISSN:
- 1471-2105
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract In this paper, we focus on constructing unique-decodable and list-decodable codes for the recently studied (t, e)-composite-asymmetric error-correcting codes ((t, e)-CAECCs). Let$$\mathcal {X}$$ be an$$m \times n$$ binary matrix in which each row has Hamming weightw. If at mosttrows of$$\mathcal {X}$$ contain errors, and in each erroneous row, there are at mosteoccurrences of$$1 \rightarrow 0$$ errors, we say that a (t, e)-composite-asymmetric error occurs in$$\mathcal {X}$$ . For general values ofm, n, w, t, ande, we propose new constructions of (t, e)-CAECCs with redundancy at most$$(t-1)\log (m) + O(1)$$ , whereO(1) is independent of the code lengthm. In particular, this yields a class of (2, e)-CAECCs that are optimal in terms of redundancy. Whenmis a prime power, the redundancy can be further reduced to$$(t-1)\log (m) - O(\log (m))$$ . To further increase the code size, we introduce a combinatorial object called a weak$$B_e$$ -set. When$$e = w$$ , we present an efficient encoding and decoding method for our codes. Finally, we explore potential improvements by relaxing the requirement of unique decoding to list-decoding. We show that when the list size ist! or an exponential function oft, there exist list-decodable (t, e)-CAECCs with constant redundancy. When the list size is two, we construct list-decodable (3, 2)-CAECCs with redundancy$$\log (m) + O(1)$$ .more » « less
-
Abstract Matrix reduction is the standard procedure for computing the persistent homology of a filtered simplicial complex withmsimplices. Its output is a particular decomposition of the total boundary matrix, from which the persistence diagrams and generating cycles are derived. Persistence diagrams are known to vary continuously with respect to their input, motivating the study of their computation for time-varying filtered complexes. Computing persistence dynamically can be reduced to maintaining a valid decomposition under adjacent transpositions in the filtration order. Since there are$$O(m^2)$$ such transpositions, this maintenance procedure exhibits limited scalability and is often too fine for many applications. We propose a coarser strategy for maintaining the decomposition over a 1-parameter family of filtrations. By reduction to a particular longest common subsequence problem, we show that the minimal number of decomposition updatesdcan be found in$$O(m \log \log m)$$ time andO(m) space, and that the corresponding sequence of permutations—which we call aschedule—can be constructed in$$O(d m \log m)$$ time. We also show that, in expectation, the storage needed to employ this strategy is actually sublinear inm. Exploiting this connection, we show experimentally that the decrease in operations to compute diagrams across a family of filtrations is proportional to the difference between the expected quadratic number of states and the proposed sublinear coarsening. Applications to video data, dynamic metric space data, and multiparameter persistence are also presented.more » « less
-
Abstract In a Merlin–Arthur proof system, the proof verifier (Arthur) accepts valid proofs (from Merlin) with probability 1, and rejects invalid proofs with probability arbitrarily close to 1. The running time of such a system is defined to be the length of Merlin’s proof plus the running time of Arthur. We provide new Merlin–Arthur proof systems for some key problems in fine-grained complexity. In several cases our proof systems have optimal running time. Our main results include:Certifying that a list ofnintegers has no 3-SUM solution can be done in Merlin–Arthur time$$\tilde{O}(n)$$ . Previously, Carmosino et al. [ITCS 2016] showed that the problem has a nondeterministic algorithm running in$$\tilde{O}(n^{1.5})$$ time (that is, there is a proof system with proofs of length$$\tilde{O}(n^{1.5})$$ and a deterministic verifier running in$$\tilde{O}(n^{1.5})$$ time).Counting the number ofk-cliques with total edge weight equal to zero in ann-node graph can be done in Merlin–Arthur time$${\tilde{O}}(n^{\lceil k/2\rceil })$$ (where$$k\ge 3$$ ). For oddk, this bound can be further improved for sparse graphs: for example, counting the number of zero-weight triangles in anm-edge graph can be done in Merlin–Arthur time$${\tilde{O}}(m)$$ . Previous Merlin–Arthur protocols by Williams [CCC’16] and Björklund and Kaski [PODC’16] could only countk-cliques in unweighted graphs, and had worse running times for smallk.Computing the All-Pairs Shortest Distances matrix for ann-node graph can be done in Merlin–Arthur time$$\tilde{O}(n^2)$$ . Note this is optimal, as the matrix can have$$\Omega (n^2)$$ nonzero entries in general. Previously, Carmosino et al. [ITCS 2016] showed that this problem has an$$\tilde{O}(n^{2.94})$$ nondeterministic time algorithm.Certifying that ann-variablek-CNF is unsatisfiable can be done in Merlin–Arthur time$$2^{n/2 - n/O(k)}$$ . We also observe an algebrization barrier for the previous$$2^{n/2}\cdot \textrm{poly}(n)$$ -time Merlin–Arthur protocol of R. Williams [CCC’16] for$$\#$$ SAT: in particular, his protocol algebrizes, and we observe there is no algebrizing protocol fork-UNSAT running in$$2^{n/2}/n^{\omega (1)}$$ time. Therefore we have to exploit non-algebrizing properties to obtain our new protocol.Certifying a Quantified Boolean Formula is true can be done in Merlin–Arthur time$$2^{4n/5}\cdot \textrm{poly}(n)$$ . Previously, the only nontrivial result known along these lines was an Arthur–Merlin–Arthur protocol (where Merlin’s proof depends on some of Arthur’s coins) running in$$2^{2n/3}\cdot \textrm{poly}(n)$$ time.Due to the centrality of these problems in fine-grained complexity, our results have consequences for many other problems of interest. For example, our work implies that certifying there is no Subset Sum solution tonintegers can be done in Merlin–Arthur time$$2^{n/3}\cdot \textrm{poly}(n)$$ , improving on the previous best protocol by Nederlof [IPL 2017] which took$$2^{0.49991n}\cdot \textrm{poly}(n)$$ time.more » « less
-
Abstract It is natural to generalize the online$$k$$ -Server problem by allowing each request to specify not only a pointp, but also a subsetSof servers that may serve it. To date, only a few special cases of this problem have been studied. The objective of the work presented in this paper has been to more systematically explore this generalization in the case of uniform and star metrics. For uniform metrics, the problem is equivalent to a generalization of Paging in which each request specifies not only a pagep, but also a subsetSof cache slots, and is satisfied by having a copy ofpin some slot inS. We call this problemSlot-Heterogenous Paging. In realistic settings only certain subsets of cache slots or servers would appear in requests. Therefore we parameterize the problem by specifying a family$${\mathcal {S}}\subseteq 2^{[k]}$$ of requestable slot sets, and we establish bounds on the competitive ratio as a function of the cache sizekand family$${\mathcal {S}}$$ :If all request sets are allowed ($${\mathcal {S}}=2^{[k]}\setminus \{\emptyset \}$$ ), the optimal deterministic and randomized competitive ratios are exponentially worse than for standard Paging ($${\mathcal {S}}=\{[k]\}$$ ).As a function of$$|{\mathcal {S}}|$$ andk, the optimal deterministic ratio is polynomial: at most$$O(k^2|{\mathcal {S}}|)$$ and at least$$\Omega (\sqrt{|{\mathcal {S}}|})$$ .For any laminar family$${\mathcal {S}}$$ of heighth, the optimal ratios areO(hk) (deterministic) and$$O(h^2\log k)$$ (randomized).The special case of laminar$${\mathcal {S}}$$ that we callAll-or-One Pagingextends standard Paging by allowing each request to specify a specific slot to put the requested page in. The optimal deterministic ratio forweightedAll-or-One Paging is$$\Theta (k)$$ . Offline All-or-One Paging is$$\mathbb{N}\mathbb{P}$$ -hard.Some results for the laminar case are shown via a reduction to the generalization of Paging in which each request specifies a set$$P$$ ofpages, and is satisfied by fetching any page from$$P$$ into the cache. The optimal ratios for the latter problem (with laminar family of heighth) are at mosthk(deterministic) and$$hH_k$$ (randomized).more » « less
An official website of the United States government
