skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fast and exact quantification of motif occurrences in biological sequences
Abstract Background Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . Conclusions The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.  more » « less
Award ID(s):
2013998
PAR ID:
10301294
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
BMC Bioinformatics
Volume:
22
Issue:
1
ISSN:
1471-2105
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Woodruff, David P. (Ed.)
    Matroids are a fundamental object of study in combinatorial optimization. Three closely related and important problems involving matroids are maximizing the size of the union of $$k$$ independent sets (that is, \emph{$$k$$-fold matroid union}), computing $$k$$ disjoint bases (a.k.a.\ \emph{matroid base packing}), and covering the elements by $$k$$ bases (a.k.a.\ \emph{matroid base covering}). These problems generalize naturally to integral and real-valued capacities on the elements. This work develops faster exact and/or approximation problems for these and some other closely related problems such as optimal reinforcement and matroid membership. We obtain improved running times both for general matroids in the independence oracle model and for the graphic matroid. The main thrust of our improvements comes from developing a faster and unifying \emph{push-relabel} algorithm for the integer-capacitated versions of these problems, building on previous work by [FM12]. We then build on this algorithm in two directions. First we develop a faster augmenting path subroutine for $$k$$-fold matroid union that, when appended to an approximation version of the push-relabel algorithm, gives a faster exact algorithm for some parameters of $$k$$. In particular we obtain a subquadratic-query running time in the uncapacitated setting for the three basic problems listed above. We also obtain faster approximation algorithms for these problems with real-valued capacities by reducing to small integral capacities via randomized rounding. To this end, we develop a new randomized rounding technique for base covering problems in matroids that may also be of independent interest. 
    more » « less
  2. Building on previous work in computer generated jazz solos using probabilistic grammars, this paper describes research extending the capabilities of the current learning process and grammar representation used in the Impro-Visor educational music software with the concepts of motifs and motif patterns. An approach has been developed using clustering, best match search techniques, and probabilistic grammar rules to identify motifs and incorporate them into computer generated solos. The abilities of this technique are further expanded through the use of motif patterns. Motif patterns are used to induce coherence in generated solos by learning the patterns in which motifs were used in a given set of transcriptions. This approach is implemented as a feature of the Impro- Visor software. 
    more » « less
  3. Building on previous work in computer generated jazz solos using probabilistic grammars, this paper describes research extending the capabilities of the current learning process and grammar representation used in the Impro-Visor educational music software with the concepts of motifs and motif patterns. An approach has been developed using clustering, best match search techniques, and probabilistic grammar rules to identify motifs and incorporate them into computer generated solos. The abilities of this technique are further expanded through the use of motif patterns. Motif patterns are used to induce coherence in generated solos by learning the patterns in which motifs were used in a given set of transcriptions. This approach is implemented as a feature of the Impro- Visor software. 
    more » « less
  4. This paper presents a black-box framework for accelerating packing optimization solvers. Our method applies to packing linear programming problems and a family of convex programming problems with linear constraints. The framework is designed for high-dimensional problems, for which the number of variables n is much larger than the number of measurements m. Given an [Formula: see text] problem, we construct a smaller [Formula: see text] problem, whose solution we use to find an approximation to the optimal solution. Our framework can accelerate both exact and approximate solvers. If the solver being accelerated produces an α-approximation, then we produce a [Formula: see text]-approximation of the optimal solution to the original problem. We present worst-case guarantees on run time and empirically demonstrate speedups of two orders of magnitude. 
    more » « less
  5. We survey recent work on approximation algorithms for computing degree-constrained subgraphs in graphs and their applications in combinatorial scientific computing. The problems we consider include maximization versions of cardinality matching, edge-weighted matching, vertex-weighted matching and edge-weighted $$b$$ -matching, and minimization versions of weighted edge cover and $$b$$ -edge cover. Exact algorithms for these problems are impractical for massive graphs with several millions of edges. For each problem we discuss theoretical foundations, the design of several linear or near-linear time approximation algorithms, their implementations on serial and parallel computers, and applications. Our focus is on practical algorithms that yield good performance on modern computer architectures with multiple threads and interconnected processors. We also include information about the software available for these problems. 
    more » « less