skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Locality-sensitive bucketing functions for the edit distance
Abstract BackgroundMany bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existingk-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. ResultsIn this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be$$(d_1, d_2)$$ ( d 1 , d 2 ) -sensitive if any two sequences within an edit distance of$$d_1$$ d 1 are mapped into at least one shared bucket, and any two sequences with distance at least$$d_2$$ d 2 are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of$$(d_1,d_2)$$ ( d 1 , d 2 ) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. ConclusionThese results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.  more » « less
Award ID(s):
2145171 2019797
PAR ID:
10434491
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Algorithms for Molecular Biology
Volume:
18
Issue:
1
ISSN:
1748-7188
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Boucher, Christina; Rahmann, Sven (Ed.)
    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions. 
    more » « less
  2. Abstract The elliptic flow$$(v_2)$$ ( v 2 ) of$${\textrm{D}}^{0}$$ D 0 mesons from beauty-hadron decays (non-prompt$${\textrm{D}}^{0})$$ D 0 ) was measured in midcentral (30–50%) Pb–Pb collisions at a centre-of-mass energy per nucleon pair$$\sqrt{s_{\textrm{NN}}} = 5.02$$ s NN = 5.02  TeV with the ALICE detector at the LHC. The$${\textrm{D}}^{0}$$ D 0 mesons were reconstructed at midrapidity$$(|y|<0.8)$$ ( | y | < 0.8 ) from their hadronic decay$$\mathrm {D^0 \rightarrow K^-\uppi ^+}$$ D 0 K - π + , in the transverse momentum interval$$2< p_{\textrm{T}} < 12$$ 2 < p T < 12  GeV/c. The result indicates a positive$$v_2$$ v 2 for non-prompt$${{\textrm{D}}^{0}}$$ D 0 mesons with a significance of 2.7$$\sigma $$ σ . The non-prompt$${{\textrm{D}}^{0}}$$ D 0 -meson$$v_2$$ v 2 is lower than that of prompt non-strange D mesons with 3.2$$\sigma $$ σ significance in$$2< p_\textrm{T} < 8~\textrm{GeV}/c$$ 2 < p T < 8 GeV / c , and compatible with the$$v_2$$ v 2 of beauty-decay electrons. Theoretical calculations of beauty-quark transport in a hydrodynamically expanding medium describe the measurement within uncertainties. 
    more » « less
  3. Abstract Let$$\mathbb {F}_q^d$$ F q d be thed-dimensional vector space over the finite field withqelements. For a subset$$E\subseteq \mathbb {F}_q^d$$ E F q d and a fixed nonzero$$t\in \mathbb {F}_q$$ t F q , let$$\mathcal {H}_t(E)=\{h_y: y\in E\}$$ H t ( E ) = { h y : y E } , where$$h_y:E\rightarrow \{0,1\}$$ h y : E { 0 , 1 } is the indicator function of the set$$\{x\in E: x\cdot y=t\}$$ { x E : x · y = t } . Two of the authors, with Maxwell Sun, showed in the case$$d=3$$ d = 3 that if$$|E|\ge Cq^{\frac{11}{4}}$$ | E | C q 11 4 andqis sufficiently large, then the VC-dimension of$$\mathcal {H}_t(E)$$ H t ( E ) is 3. In this paper, we generalize the result to arbitrary dimension by showing that the VC-dimension of$$\mathcal {H}_t(E)$$ H t ( E ) isdwhenever$$E\subseteq \mathbb {F}_q^d$$ E F q d with$$|E|\ge C_d q^{d-\frac{1}{d-1}}$$ | E | C d q d - 1 d - 1
    more » « less
  4. Abstract We perform path-integral molecular dynamics (PIMD), ring-polymer MD (RPMD), and classical MD simulations of H$$_2$$ 2 O and D$$_2$$ 2 O using the q-TIP4P/F water model over a wide range of temperatures and pressures. The density$$\rho (T)$$ ρ ( T ) , isothermal compressibility$$\kappa _T(T)$$ κ T ( T ) , and self-diffusion coefficientsD(T) of H$$_2$$ 2 O and D$$_2$$ 2 O are in excellent agreement with available experimental data; the isobaric heat capacity$$C_P(T)$$ C P ( T ) obtained from PIMD and MD simulations agree qualitatively well with the experiments. Some of these thermodynamic properties exhibit anomalous maxima upon isobaric cooling, consistent with recent experiments and with the possibility that H$$_2$$ 2 O and D$$_2$$ 2 O exhibit a liquid-liquid critical point (LLCP) at low temperatures and positive pressures. The data from PIMD/MD for H$$_2$$ 2 O and D$$_2$$ 2 O can be fitted remarkably well using the Two-State-Equation-of-State (TSEOS). Using the TSEOS, we estimate that the LLCP for q-TIP4P/F H$$_2$$ 2 O, from PIMD simulations, is located at$$P_c = 167 \pm 9$$ P c = 167 ± 9  MPa,$$T_c = 159 \pm 6$$ T c = 159 ± 6  K, and$$\rho _c = 1.02 \pm 0.01$$ ρ c = 1.02 ± 0.01  g/cm$$^3$$ 3 . Isotope substitution effects are important; the LLCP location in q-TIP4P/F D$$_2$$ 2 O is estimated to be$$P_c = 176 \pm 4$$ P c = 176 ± 4  MPa,$$T_c = 177 \pm 2$$ T c = 177 ± 2  K, and$$\rho _c = 1.13 \pm 0.01$$ ρ c = 1.13 ± 0.01  g/cm$$^3$$ 3 . Interestingly, for the water model studied, differences in the LLCP location from PIMD and MD simulations suggest that nuclear quantum effects (i.e., atoms delocalization) play an important role in the thermodynamics of water around the LLCP (from the MD simulations of q-TIP4P/F water,$$P_c = 203 \pm 4$$ P c = 203 ± 4  MPa,$$T_c = 175 \pm 2$$ T c = 175 ± 2  K, and$$\rho _c = 1.03 \pm 0.01$$ ρ c = 1.03 ± 0.01  g/cm$$^3$$ 3 ). Overall, our results strongly support the LLPT scenario to explain water anomalous behavior, independently of the fundamental differences between classical MD and PIMD techniques. The reported values of$$T_c$$ T c for D$$_2$$ 2 O and, particularly, H$$_2$$ 2 O suggest that improved water models are needed for the study of supercooled water. 
    more » « less
  5. Abstract Let$$(h_I)$$ ( h I ) denote the standard Haar system on [0, 1], indexed by$$I\in \mathcal {D}$$ I D , the set of dyadic intervals and$$h_I\otimes h_J$$ h I h J denote the tensor product$$(s,t)\mapsto h_I(s) h_J(t)$$ ( s , t ) h I ( s ) h J ( t ) ,$$I,J\in \mathcal {D}$$ I , J D . We consider a class of two-parameter function spaces which are completions of the linear span$$\mathcal {V}(\delta ^2)$$ V ( δ 2 ) of$$h_I\otimes h_J$$ h I h J ,$$I,J\in \mathcal {D}$$ I , J D . This class contains all the spaces of the formX(Y), whereXandYare either the Lebesgue spaces$$L^p[0,1]$$ L p [ 0 , 1 ] or the Hardy spaces$$H^p[0,1]$$ H p [ 0 , 1 ] ,$$1\le p < \infty $$ 1 p < . We say that$$D:X(Y)\rightarrow X(Y)$$ D : X ( Y ) X ( Y ) is a Haar multiplier if$$D(h_I\otimes h_J) = d_{I,J} h_I\otimes h_J$$ D ( h I h J ) = d I , J h I h J , where$$d_{I,J}\in \mathbb {R}$$ d I , J R , and ask which more elementary operators factor throughD. A decisive role is played by theCapon projection$$\mathcal {C}:\mathcal {V}(\delta ^2)\rightarrow \mathcal {V}(\delta ^2)$$ C : V ( δ 2 ) V ( δ 2 ) given by$$\mathcal {C} h_I\otimes h_J = h_I\otimes h_J$$ C h I h J = h I h J if$$|I|\le |J|$$ | I | | J | , and$$\mathcal {C} h_I\otimes h_J = 0$$ C h I h J = 0 if$$|I| > |J|$$ | I | > | J | , as our main result highlights: Given any bounded Haar multiplier$$D:X(Y)\rightarrow X(Y)$$ D : X ( Y ) X ( Y ) , there exist$$\lambda ,\mu \in \mathbb {R}$$ λ , μ R such that$$\begin{aligned} \lambda \mathcal {C} + \mu ({{\,\textrm{Id}\,}}-\mathcal {C})\text { approximately 1-projectionally factors through }D, \end{aligned}$$ λ C + μ ( Id - C ) approximately 1-projectionally factors through D , i.e., for all$$\eta > 0$$ η > 0 , there exist bounded operatorsA, Bso thatABis the identity operator$${{\,\textrm{Id}\,}}$$ Id ,$$\Vert A\Vert \cdot \Vert B\Vert = 1$$ A · B = 1 and$$\Vert \lambda \mathcal {C} + \mu ({{\,\textrm{Id}\,}}-\mathcal {C}) - ADB\Vert < \eta $$ λ C + μ ( Id - C ) - A D B < η . Additionally, if$$\mathcal {C}$$ C is unbounded onX(Y), then$$\lambda = \mu $$ λ = μ and then$${{\,\textrm{Id}\,}}$$ Id either factors throughDor$${{\,\textrm{Id}\,}}-D$$ Id - D
    more » « less