Abstract Motivation RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy methods to partition functionbased methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore prohibitively slow for long sequences. This slowness is even more severe than cubictime free energy minimization due to a substantially larger constant factor in runtime. Results Inspired by the success of our recent LinearFold algorithm that predicts the approximate minimum free energy structure in linear time, we design a similar lineartime heuristic algorithm, LinearPartition, to approximate the partition function and basepairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g. 2.5 days versus 1.3 min on a sequence with length 32 753 nt). More interestingly, the resulting basepairing probabilities are even better correlated with the groundtruth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNAs), as well as a substantial improvement on longdistance base pairs (500+ nt apart).more »
LinearFold: lineartime approximate RNA folding by 5'to3' dynamic programming and beam search
Abstract Motivation Predicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genomewide applications. Results We present a novel alternative O(n3)time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a highquality approximation to the optimal solution. Inspired by incremental parsing for contextfree grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a lefttoright (5′to3′) direction rather than in a bottomup fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for longrange base pairs (500+ nucleotides apart), both of more »
 Award ID(s):
 1817231
 Publication Date:
 NSFPAR ID:
 10391388
 Journal Name:
 Bioinformatics
 Volume:
 35
 Issue:
 14
 Page Range or eLocationID:
 i295 to i304
 ISSN:
 13674803
 Sponsoring Org:
 National Science Foundation
More Like this


Abstract Motivation It is well known that the integration among different datasources is reliable because of its potential of unveiling new functionalities of the genomic expressions, which might be dormant in a singlesource analysis. Moreover, different studies have justified the more powerful analyses of multiplatform data. Toward this, in this study, we consider the circadian genes’ omics profile, such as copy number changes and RNAsequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failuretime regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them. Results Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings. Availability and implementation The developed method is wrapped in R package available at https://github.com/MAITYA02/semmcmc. Supplementary information Supplementary data are available at Bioinformatics online.

Graphs model realworld circumstances in many applications where they may constantly change to capture the dynamic behavior of the phenomena. Topological persistence which provides a set of birth and death pairs for the topological features is one instrument for analyzing such changing graph data. However, standard persistent homology defined over a growing space cannot always capture such a dynamic process unless shrinking with deletions is also allowed. Hence, zigzag persistence which incorporates both insertions and deletions of simplices is more appropriate in such a setting. Unlike standard persistence which admits nearly lineartime algorithms for graphs, such results for the zigzag version improving the general O(m^ω) time complexity are not known, where ω < 2.37286 is the matrix multiplication exponent. In this paper, we propose algorithms for zigzag persistence on graphs which run in nearlinear time. Specifically, given a filtration with m additions and deletions on a graph with n vertices and edges, the algorithm for 0dimension runs in O(mlog² n+mlog m) time and the algorithm for 1dimension runs in O(mlog⁴ n) time. The algorithm for 0dimension draws upon another algorithm designed originally for pairing critical points of Morse functions on 2manifolds. The algorithm for 1dimension pairs a negative edge withmore »

Abstract Background Alignmentfree methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS , and its k mismatch counterpart, ACS k , have been shown to produce results as effective as multiplesequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O ( n log k n ) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel lineartime heuristic to approximate ACS k , which is faster than computing the exact ACS k while being closer to the exact ACS k values compared to previously published lineartime greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACS k and is applicable for the alignmentfree comparison of biological sequences at highly competitive speed. The algorithmmore »

We present a general framework of designing efficient dynamic approximate algorithms for optimization on undirected graphs. In particular, we develop a technique that, given any problem that admits a certain notion of vertex sparsifiers, gives data structures that maintain approximate solutions in sublinear update and query time. We illustrate the applicability of our paradigm to the following problems. (1) A fullydynamic algorithm that approximates allpair maximumflows/minimumcuts up to a nearly logarithmic factor in $\tilde{O}(n^{2/3})$ amortized time against an oblivious adversary, and $\tilde{O}(m^{3/4})$ time against an adaptive adversary. (2) An incremental data structure that maintains $O(1)$approximate shortest path in $n^{o(1)}$ time per operation, as well as fully dynamic approximate allpair shortest path and transshipment in $\tilde{O}(n^{2/3+o(1)})$ amortized time per operation. (3) A fullydynamic algorithm that approximates allpair effective resistance up to an $(1+\eps)$ factor in $\tilde{O}(n^{2/3+o(1)} \epsilon^{O(1)})$ amortized update time per operation. The key tool behind result (1) is the dynamic maintenance of an algorithmic construction due to Madry [FOCS' 10], which partitions a graph into a collection of simpler graph structures (known as jtrees) and approximately captures the cutflow and metric structure of the graph. The $O(1)$approximation guarantee of (2) is by adapting the distance oracles by [ThorupZwick JACM `05].more »