Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ time and$$\mathcal {O} (\ell \log \ell )$$ space, where |t| is the number of$$k$$ -mers inside the sketch of the reference, |p| is the number of$$k$$ -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
more »
« less
This content will become publicly available on December 1, 2026
Rocket-crane algorithm for the Feedback Arc Set problem
Understanding information flow in the brain can be facilitated by arranging neurons in the fly connectome to form a maximally “feedforward” structure. This task is naturally formulated as the Minimum Feedback Arc Set (MFAS)—a well-known NP-hard problem, especially for large-scale graphs. To address this, we propose the Rocket-Crane algorithm, an efficient two-phase method for solving MFAS. In the first phase, we develop a continuous-space optimization method that rapidly generates excellent solutions. In the second phase, we refine these solutions through advanced exploration techniques that integrate randomized and heuristic strategies to effectively escape local minima. Extensive experiments demonstrate that Rocket-Crane outperforms state-of-the-art methods in terms of solution quality, scalability, and computational efficiency. On the primary benchmark—the fly connectom—our method achieved a feedforward arc set with a total forward weight of 35,459,266 (about 85$$\%$$ ), the highest among all competing methods. The algorithm is open-source and available on GitHub.
more »
« less
- PAR ID:
- 10655187
- Publisher / Repository:
- Springer
- Date Published:
- Journal Name:
- Social Network Analysis and Mining
- Volume:
- 15
- Issue:
- 1
- ISSN:
- 1869-5469
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The Li-Stephens (LS) haplotype copying model forms the basis of a number of important statistical inference procedures in genetics. LS is a probabilistic generative model which supposes that a sampled chromosome is an imperfect mosaic of other chromosomes found in a population. In the frequentist setting which is the focus of this paper, the output of LS is a “copying path” through chromosome space. The behavior of LS depends crucially on two user-specified parameters,$$\theta $$ and$$\rho $$ , which are respectively interpreted as the rates of mutation and recombination. However, because LS is not based on a realistic model of ancestry, the precise connection between these parameters and the biological phenomena they represent is unclear. Here, we offer an alternative perspective, which considers$$\theta $$ and$$\rho $$ as tuning parameters, and seeks to understand their impact on the LS output. We derive an algorithm which, for a given dataset, efficiently partitions the$$(\theta ,\rho )$$ plane into regions where the output of the algorithm is constant, thereby enumerating all possible solutions to the LS model in one go. We extend this approach to the “diploid LS” model commonly used for phasing. We demonstrate the usefulness of our method by studying the effects of changing$$\theta $$ and$$\rho $$ when using LS for common bioinformatic tasks. Our findings indicate that using the conventional (i.e., population-scaled) values for$$\theta $$ and$$\rho $$ produces near optimal results for imputation, but may systematically inflate switch error in the case of phasing diploid genotypes.more » « less
-
Abstract We consider the problem of covering multiple submodular constraints. Given a finite ground setN, a weight function$$w: N \rightarrow \mathbb {R}_+$$ ,rmonotone submodular functions$$f_1,f_2,\ldots ,f_r$$ overNand requirements$$k_1,k_2,\ldots ,k_r$$ the goal is to find a minimum weight subset$$S \subseteq N$$ such that$$f_i(S) \ge k_i$$ for$$1 \le i \le r$$ . We refer to this problem asMulti-Submod-Coverand it was recently considered by Har-Peled and Jones (Few cuts meet many point sets. CoRR.arxiv:abs1808.03260Har-Peled and Jones 2018) who were motivated by an application in geometry. Even with$$r=1$$ Multi-Submod-Covergeneralizes the well-known Submodular Set Cover problem (Submod-SC), and it can also be easily reduced toSubmod-SC. A simple greedy algorithm gives an$$O(\log (kr))$$ approximation where$$k = \sum _i k_i$$ and this ratio cannot be improved in the general case. In this paper, motivated by several concrete applications, we consider two ways to improve upon the approximation given by the greedy algorithm. First, we give a bicriteria approximation algorithm forMulti-Submod-Coverthat covers each constraint to within a factor of$$(1-1/e-\varepsilon )$$ while incurring an approximation of$$O(\frac{1}{\epsilon }\log r)$$ in the cost. Second, we consider the special case when each$$f_i$$ is a obtained from a truncated coverage function and obtain an algorithm that generalizes previous work on partial set cover (Partial-SC), covering integer programs (CIPs) and multiple vertex cover constraints Bera et al. (Theoret Comput Sci 555:2–8 Bera et al. 2014). Both these algorithms are based on mathematical programming relaxations that avoid the limitations of the greedy algorithm. We demonstrate the implications of our algorithms and related ideas to several applications ranging from geometric covering problems to clustering with outliers. Our work highlights the utility of the high-level model and the lens of submodularity in addressing this class of covering problems.more » « less
-
Agricola, Ilka; Bögelein, Verena (Ed.)Abstract Murphy and the second author showed that a generic closed Riemannian manifold has no totally geodesic submanifolds, provided the ambient space is at least four dimensional. Lytchak and Petrunin established a similar result in dimension 3. For the higher dimensional result, the “generic set” is open and dense in the$$C^{q}$$ –topology for any$$q\ge 2.$$ In Lytchak and Petrunin’s work, the “generic set” is a dense$$G_{\delta }$$ in the$$C^{q}$$ –topology for any$$q\ge 2.$$ Here we show that the set of such metrics on a compact 3–manifold actually contains a set that is that is open and dense set in the$$C^{q}$$ –topology, provided$$q\ge 3.$$more » « less
-
Abstract The shear viscosity$$\eta $$ of a quark–gluon plasma in equilibrium can be calculated analytically using multiple methods or numerically using the Green–Kubo relation. It has been realized, which we confirm here, that the Chapman–Enskog method agrees well with the Green–Kubo result for both isotropic and anisotropic two-body scatterings. We then apply the Chapman–Enskog method to study the shear viscosity of the parton matter from a multi-phase transport model. In particular, we study the parton matter in the center cell of central and midcentral Au + Au collisions at 200AGeV and Pb + Pb collisions at 2760AGeV, which is assumed to be a plasma in thermal equilibrium but partial chemical equilibrium. As a result of using a constant Debye mass or cross section$$\sigma $$ for parton scatterings, the$$\eta /s$$ ratio increases with time (as the effective temperature decreases), contrary to the trend preferred by Bayesian analysis of the experimental data or pQCD results that use temperature-dependent Debye masses. At$$\sigma =3$$ mb that enables the transport model to approximately reproduce the elliptic flow data of the bulk matter, the average$$\eta /s$$ of the parton matter in partial equilibrium is found to be very small, between one to two times$$1/(4\pi )$$ .more » « less
An official website of the United States government
