skip to main content


Title: nPoRe: n-polymer realigner for improved pileup-based variant calling
Abstract

Despite recent improvements in nanopore basecalling accuracy, germline variant calling of small insertions and deletions (INDELs) remains poor. Although precision and recall for single nucleotide polymorphisms (SNPs) now exceeds 99.5%, INDEL recall remains below 80% for standard R9.4.1 flow cells. We show that read phasing and realignment can recover a significant portion of false negative INDELs. In particular, we extend Needleman-Wunsch affine gap alignment by introducing new gap penalties for more accurately aligning repeatedn-polymer sequences such as homopolymers ($$n=1$$n=1) and tandem repeats ($$2 \le n \le 6$$2n6). At the same precision, haplotype phasing improves INDEL recall from 63.76 to$$70.66\%$$70.66%and nPoRe realignment improves it further to$$73.04\%$$73.04%.

 
more » « less
NSF-PAR ID:
10403258
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Volume:
24
Issue:
1
ISSN:
1471-2105
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    A classical parking function of lengthnis a list of positive integers$$(a_1, a_2, \ldots , a_n)$$(a1,a2,,an)whose nondecreasing rearrangement$$b_1 \le b_2 \le \cdots \le b_n$$b1b2bnsatisfies$$b_i \le i$$bii. The convex hull of all parking functions of lengthnis ann-dimensional polytope in$${\mathbb {R}}^n$$Rn, which we refer to as the classical parking function polytope. Its geometric properties have been explored in Amanbayeva and Wang (Enumer Combin Appl 2(2):Paper No. S2R10, 10, 2022) in response to a question posed by Stanley (Amer Math Mon 127(6):563–571, 2020). We generalize this family of polytopes by studying the geometric properties of the convex hull of$${\textbf{x}}$$x-parking functions for$${\textbf{x}}=(a,b,\dots ,b)$$x=(a,b,,b), which we refer to as$${\textbf{x}}$$x-parking function polytopes. We explore connections between these$${\textbf{x}}$$x-parking function polytopes, the Pitman–Stanley polytope, and the partial permutahedra of Heuer and Striker (SIAM J Discrete Math 36(4):2863–2888, 2022). In particular, we establish a closed-form expression for the volume of$${\textbf{x}}$$x-parking function polytopes. This allows us to answer a conjecture of Behrend et al. (2022) and also obtain a new closed-form expression for the volume of the convex hull of classical parking functions as a corollary.

     
    more » « less
  2. Abstract

    A well-known open problem of Meir and Moser asks if the squares of sidelength 1/nfor$$n\ge 2$$n2can be packed perfectly into a rectangle of area$$\sum _{n=2}^\infty n^{-2}=\pi ^2/6-1$$n=2n-2=π2/6-1. In this paper we show that for any$$1/21/2<t<1, and any$$n_0$$n0that is sufficiently large depending on t, the squares of sidelength$$n^{-t}$$n-tfor$$n\ge n_0$$nn0can be packed perfectly into a square of area$$\sum _{n=n_0}^\infty n^{-2t}$$n=n0n-2t. This was previously known (if one packs a rectangle instead of a square) for$$1/21/2<t2/3(in which case one can take$$n_0=1$$n0=1).

     
    more » « less
  3. Abstract

    We prove that$${{\,\textrm{poly}\,}}(t) \cdot n^{1/D}$$poly(t)·n1/D-depth local random quantum circuits with two qudit nearest-neighbor gates on aD-dimensional lattice withnqudits are approximatet-designs in various measures. These include the “monomial” measure, meaning that the monomials of a random circuit from this family have expectation close to the value that would result from the Haar measure. Previously, the best bound was$${{\,\textrm{poly}\,}}(t)\cdot n$$poly(t)·ndue to Brandão–Harrow–Horodecki (Commun Math Phys 346(2):397–434, 2016) for$$D=1$$D=1. We also improve the “scrambling” and “decoupling” bounds for spatially local random circuits due to Brown and Fawzi (Scrambling speed of random quantum circuits, 2012). One consequence of our result is that assuming the polynomial hierarchy ($${{\,\mathrm{\textsf{PH}}\,}}$$PH) is infinite and that certain counting problems are$$\#{\textsf{P}}$$#P-hard “on average”, sampling within total variation distance from these circuits is hard for classical computers. Previously, exact sampling from the outputs of even constant-depth quantum circuits was known to be hard for classical computers under these assumptions. However the standard strategy for extending this hardness result to approximate sampling requires the quantum circuits to have a property called “anti-concentration”, meaning roughly that the output has near-maximal entropy. Unitary 2-designs have the desired anti-concentration property. Our result improves the required depth for this level of anti-concentration from linear depth to a sub-linear value, depending on the geometry of the interactions. This is relevant to a recent experiment by the Google Quantum AI group to perform such a sampling task with 53 qubits on a two-dimensional lattice (Arute in Nature 574(7779):505–510, 2019; Boixo et al. in Nate Phys 14(6):595–600, 2018) (and related experiments by USTC), and confirms their conjecture that$$O(\sqrt{n})$$O(n)depth suffices for anti-concentration. The proof is based on a previous construction oft-designs by Brandão et al. (2016), an analysis of how approximate designs behave under composition, and an extension of the quasi-orthogonality of permutation operators developed by Brandão et al. (2016). Different versions of the approximate design condition correspond to different norms, and part of our contribution is to introduce the norm corresponding to anti-concentration and to establish equivalence between these various norms for low-depth circuits. For random circuits with long-range gates, we use different methods to show that anti-concentration happens at circuit size$$O(n\ln ^2 n)$$O(nln2n)corresponding to depth$$O(\ln ^3 n)$$O(ln3n). We also show a lower bound of$$\Omega (n \ln n)$$Ω(nlnn)for the size of such circuit in this case. We also prove that anti-concentration is possible in depth$$O(\ln n \ln \ln n)$$O(lnnlnlnn)(size$$O(n \ln n \ln \ln n)$$O(nlnnlnlnn)) using a different model.

     
    more » « less
  4. Abstract

    We consider the problem of covering multiple submodular constraints. Given a finite ground setN, a weight function$$w: N \rightarrow \mathbb {R}_+$$w:NR+,rmonotone submodular functions$$f_1,f_2,\ldots ,f_r$$f1,f2,,froverNand requirements$$k_1,k_2,\ldots ,k_r$$k1,k2,,krthe goal is to find a minimum weight subset$$S \subseteq N$$SNsuch that$$f_i(S) \ge k_i$$fi(S)kifor$$1 \le i \le r$$1ir. We refer to this problem asMulti-Submod-Coverand it was recently considered by Har-Peled and Jones (Few cuts meet many point sets. CoRR.arxiv:abs1808.03260Har-Peled and Jones 2018) who were motivated by an application in geometry. Even with$$r=1$$r=1Multi-Submod-Covergeneralizes the well-known Submodular Set Cover problem (Submod-SC), and it can also be easily reduced toSubmod-SC. A simple greedy algorithm gives an$$O(\log (kr))$$O(log(kr))approximation where$$k = \sum _i k_i$$k=ikiand this ratio cannot be improved in the general case. In this paper, motivated by several concrete applications, we consider two ways to improve upon the approximation given by the greedy algorithm. First, we give a bicriteria approximation algorithm forMulti-Submod-Coverthat covers each constraint to within a factor of$$(1-1/e-\varepsilon )$$(1-1/e-ε)while incurring an approximation of$$O(\frac{1}{\epsilon }\log r)$$O(1ϵlogr)in the cost. Second, we consider the special case when each$$f_i$$fiis a obtained from a truncated coverage function and obtain an algorithm that generalizes previous work on partial set cover (Partial-SC), covering integer programs (CIPs) and multiple vertex cover constraints Bera et al. (Theoret Comput Sci 555:2–8 Bera et al. 2014). Both these algorithms are based on mathematical programming relaxations that avoid the limitations of the greedy algorithm. We demonstrate the implications of our algorithms and related ideas to several applications ranging from geometric covering problems to clustering with outliers. Our work highlights the utility of the high-level model and the lens of submodularity in addressing this class of covering problems.

     
    more » « less
  5. Abstract

    A long-standing problem in mathematical physics is the rigorous derivation of the incompressible Euler equation from Newtonian mechanics. Recently, Han-Kwan and Iacobelli (Proc Am Math Soc 149:3045–3061, 2021) showed that in the monokinetic regime, one can directly obtain the Euler equation from a system ofNparticles interacting in$${\mathbb {T}}^d$$Td,$$d\ge 2$$d2, via Newton’s second law through asupercritical mean-field limit. Namely, the coupling constant$$\lambda $$λin front of the pair potential, which is Coulombic, scales like$$N^{-\theta }$$N-θfor some$$\theta \in (0,1)$$θ(0,1), in contrast to the usual mean-field scaling$$\lambda \sim N^{-1}$$λN-1. Assuming$$\theta \in (1-\frac{2}{d(d+1)},1)$$θ(1-2d(d+1),1), they showed that the empirical measure of the system is effectively described by the solution to the Euler equation as$$N\rightarrow \infty $$N. Han-Kwan and Iacobelli asked if their range for$$\theta $$θwas optimal. We answer this question in the negative by showing the validity of the incompressible Euler equation in the limit$$N\rightarrow \infty $$Nfor$$\theta \in (1-\frac{2}{d},1)$$θ(1-2d,1). Our proof is based on Serfaty’s modulated-energy method, but compared to that of Han-Kwan and Iacobelli, crucially uses an improved “renormalized commutator” estimate to obtain the larger range for$$\theta $$θ. Additionally, we show that for$$\theta \le 1-\frac{2}{d}$$θ1-2d, one cannot, in general, expect convergence in the modulated energy notion of distance.

     
    more » « less