Sequence mappability is an important task in genome resequencing. In the (
Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.
In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database with
The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from
- Award ID(s):
- 1816005
- NSF-PAR ID:
- 10368623
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- BMC Bioinformatics
- Volume:
- 23
- Issue:
- 1
- ISSN:
- 1471-2105
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract k ,m )-mappability problem, for a given sequenceT of lengthn , the goal is to compute a table whosei th entry is the number of indices such that the length-$$j \ne i$$ m substrings ofT starting at positionsi andj have at mostk mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of . We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for$$k=1$$ , works in$$k=O(1)$$ space and, with high probability, in$$O(n)$$ time. Our algorithm requires a careful adaptation of the$$O(n \cdot \min \{m^k,\log ^k n\})$$ k -errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop -time algorithms to compute$$O(n^2)$$ all (k ,m )-mappability tables for a fixedm and all or a fixed$$k\in \{0,\ldots ,m\}$$ k and all . Finally, we show that, for$$m\in \{k,\ldots ,n\}$$ , the ($$k,m = \Theta (\log n)$$ k ,m )-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018. -
Abstract Matrix reduction is the standard procedure for computing the persistent homology of a filtered simplicial complex with
m simplices. Its output is a particular decomposition of the total boundary matrix, from which the persistence diagrams and generating cycles are derived. Persistence diagrams are known to vary continuously with respect to their input, motivating the study of their computation for time-varying filtered complexes. Computing persistence dynamically can be reduced to maintaining a valid decomposition under adjacent transpositions in the filtration order. Since there are such transpositions, this maintenance procedure exhibits limited scalability and is often too fine for many applications. We propose a coarser strategy for maintaining the decomposition over a 1-parameter family of filtrations. By reduction to a particular longest common subsequence problem, we show that the minimal number of decomposition updates$$O(m^2)$$ d can be found in time and$$O(m \log \log m)$$ O (m ) space, and that the corresponding sequence of permutations—which we call aschedule —can be constructed in time. We also show that, in expectation, the storage needed to employ this strategy is actually sublinear in$$O(d m \log m)$$ m . Exploiting this connection, we show experimentally that the decrease in operations to compute diagrams across a family of filtrations is proportional to the difference between the expected quadratic number of states and the proposed sublinear coarsening. Applications to video data, dynamic metric space data, and multiparameter persistence are also presented. -
Abstract We continue the program of proving circuit lower bounds via circuit satisfiability algorithms. So far, this program has yielded several concrete results, proving that functions in
and other complexity classes do not have small circuits (in the worst case and/or on average) from various circuit classes$\mathsf {Quasi}\text {-}\mathsf {NP} = \mathsf {NTIME}[n^{(\log n)^{O(1)}}]$ , by showing that$\mathcal { C}$ admits non-trivial satisfiability and/or$\mathcal { C}$ # SAT algorithms which beat exhaustive search by a minor amount. In this paper, we present a new strong lower bound consequence of having a non-trivial# SAT algorithm for a circuit class . Say that a symmetric Boolean function${\mathcal C}$ f (x 1,…,x n ) issparse if it outputs 1 onO (1) values of . We show that for every sparse${\sum }_{i} x_{i}$ f , and for all “typical” , faster$\mathcal { C}$ # SAT algorithms for circuits imply lower bounds against the circuit class$\mathcal { C}$ , which may be$f \circ \mathcal { C}$ stronger than itself. In particular:$\mathcal { C}$ # SAT algorithms forn k -size -circuits running in 2$\mathcal { C}$ n /n k time (for allk ) implyN E X P does not have -circuits of polynomial size.$(f \circ \mathcal { C})$ # SAT algorithms for -size$2^{n^{{\varepsilon }}}$ -circuits running in$\mathcal { C}$ time (for some$2^{n-n^{{\varepsilon }}}$ ε > 0) implyQ u a s i -N P does not have -circuits of polynomial size.$(f \circ \mathcal { C})$ Applying
# SAT algorithms from the literature, one immediate corollary of our results is thatQ u a s i -N P does not haveE M A J ∘A C C 0∘T H R circuits of polynomial size, whereE M A J is the “exact majority” function, improving previous lower bounds againstA C C 0[Williams JACM’14] andA C C 0∘T H R [Williams STOC’14], [Murray-Williams STOC’18]. This is the first nontrivial lower bound against such a circuit class. -
Abstract In a Merlin–Arthur proof system, the proof verifier (Arthur) accepts valid proofs (from Merlin) with probability 1, and rejects invalid proofs with probability arbitrarily close to 1. The running time of such a system is defined to be the length of Merlin’s proof plus the running time of Arthur. We provide new Merlin–Arthur proof systems for some key problems in fine-grained complexity. In several cases our proof systems have optimal running time. Our main results include:
Certifying that a list of
n integers has no 3-SUM solution can be done in Merlin–Arthur time . Previously, Carmosino et al. [ITCS 2016] showed that the problem has a nondeterministic algorithm running in$$\tilde{O}(n)$$ time (that is, there is a proof system with proofs of length$$\tilde{O}(n^{1.5})$$ and a deterministic verifier running in$$\tilde{O}(n^{1.5})$$ time).$$\tilde{O}(n^{1.5})$$ Counting the number of
k -cliques with total edge weight equal to zero in ann -node graph can be done in Merlin–Arthur time (where$${\tilde{O}}(n^{\lceil k/2\rceil })$$ ). For odd$$k\ge 3$$ k , this bound can be further improved for sparse graphs: for example, counting the number of zero-weight triangles in anm -edge graph can be done in Merlin–Arthur time . Previous Merlin–Arthur protocols by Williams [CCC’16] and Björklund and Kaski [PODC’16] could only count$${\tilde{O}}(m)$$ k -cliques in unweighted graphs, and had worse running times for smallk .Computing the All-Pairs Shortest Distances matrix for an
n -node graph can be done in Merlin–Arthur time . Note this is optimal, as the matrix can have$$\tilde{O}(n^2)$$ nonzero entries in general. Previously, Carmosino et al. [ITCS 2016] showed that this problem has an$$\Omega (n^2)$$ nondeterministic time algorithm.$$\tilde{O}(n^{2.94})$$ Certifying that an
n -variablek -CNF is unsatisfiable can be done in Merlin–Arthur time . We also observe an algebrization barrier for the previous$$2^{n/2 - n/O(k)}$$ -time Merlin–Arthur protocol of R. Williams [CCC’16] for$$2^{n/2}\cdot \textrm{poly}(n)$$ SAT: in particular, his protocol algebrizes, and we observe there is no algebrizing protocol for$$\#$$ k -UNSAT running in time. Therefore we have to exploit non-algebrizing properties to obtain our new protocol.$$2^{n/2}/n^{\omega (1)}$$ Certifying a Quantified Boolean Formula is true can be done in Merlin–Arthur time
. Previously, the only nontrivial result known along these lines was an Arthur–Merlin–Arthur protocol (where Merlin’s proof depends on some of Arthur’s coins) running in$$2^{4n/5}\cdot \textrm{poly}(n)$$ time.$$2^{2n/3}\cdot \textrm{poly}(n)$$ n integers can be done in Merlin–Arthur time , improving on the previous best protocol by Nederlof [IPL 2017] which took$$2^{n/3}\cdot \textrm{poly}(n)$$ time.$$2^{0.49991n}\cdot \textrm{poly}(n)$$ -
Abstract Let
be a configuration of$$\textbf{p}$$ n points in for some$$\mathbb R^d$$ n and some . Each pair of points defines an edge, which has a Euclidean length in the configuration. A path is an ordered sequence of the points, and a loop is a path that begins and ends at the same point. A path or loop, as a sequence of edges, also has a Euclidean length, which is simply the sum of its Euclidean edge lengths. We are interested in reconstructing$$d \ge 2$$ given a set of edge, path and loop lengths. In particular, we consider the unlabeled setting where the lengths are given simply as a set of real numbers, and are not labeled with the combinatorial data describing which paths or loops gave rise to these lengths. In this paper, we study the question of when$$\textbf{p}$$ will be uniquely determined (up to an unknowable Euclidean transform) from some given set of path or loop lengths through an exhaustive trilateration process. Such a process has already been used for the simpler problem of reconstruction using unlabeled edge lengths. This paper also provides a complete proof that this process must work in that edge-setting when given a sufficiently rich set of edge measurements and assuming that$$\textbf{p}$$ is generic.$$\textbf{p}$$