skip to main content


Title: RNA design via structure-aware multifrontier ensemble optimization
Abstract Motivation

RNA design is the search for a sequence or set of sequences that will fold to desired structure, also known as the inverse problem of RNA folding. However, the sequences designed by existing algorithms often suffer from low ensemble stability, which worsens for long sequence design. Additionally, for many methods only a small number of sequences satisfying the MFE criterion can be found by each run of design. These drawbacks limit their use cases.

Results

We propose an innovative optimization paradigm, SAMFEO, which optimizes ensemble objectives (equilibrium probability or ensemble defect) by iterative search and yields a very large number of successfully designed RNA sequences as byproducts. We develop a search method which leverages structure level and ensemble level information at different stages of the optimization: initialization, sampling, mutation, and updating. Our work, while being less complicated than others, is the first algorithm that is able to design thousands of RNA sequences for the puzzles from the Eterna100 benchmark. In addition, our algorithm solves the most Eterna100 puzzles among all the general optimization based methods in our study. The only baseline solving more puzzles than our work is dependent on handcrafted heuristics designed for a specific folding model. Surprisingly, our approach shows superiority on designing long sequences for structures adapted from the database of 16S Ribosomal RNAs.

Availability and implementation

Our source code and data used in this article is available at https://github.com/shanry/SAMFEO.

 
more » « less
Award ID(s):
2009071
NSF-PAR ID:
10427270
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
Supplement_1
ISSN:
1367-4803
Page Range / eLocation ID:
p. i563-i571
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Predicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications.

    Results

    We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5′-to-3′) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models.

    Availability and implementation

    Our source code is available at https://github.com/LinearFold/LinearFold, and our webserver is at http://linearfold.org (sequence limit: 100 000nt).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore prohibitively slow for long sequences. This slowness is even more severe than cubic-time free energy minimization due to a substantially larger constant factor in runtime. Results Inspired by the success of our recent LinearFold algorithm that predicts the approximate minimum free energy structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base-pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g. 2.5 days versus 1.3 min on a sequence with length 32 753 nt). More interestingly, the resulting base-pairing probabilities are even better correlated with the ground-truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNAs), as well as a substantial improvement on long-distance base pairs (500+ nt apart). Availability and implementation Code: http://github.com/LinearFold/LinearPartition; Server: http://linearfold.org/partition. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract

    Riboswitches are conserved structural ribonucleic acid (RNA) sensors that are mainly found to regulate a large number of genes/operons in bacteria. Presently, >50 bacterial riboswitch classes have been discovered, but only the thiamine pyrophosphate riboswitch class is detected in a few eukaryotes like fungi, plants and algae. One of the most important challenges in riboswitch research is to discover existing riboswitch classes in eukaryotes and to understand the evolution of bacterial riboswitches. However, traditional search methods for riboswitch detection have failed to detect eukaryotic riboswitches besides just one class and any distant structural homologs of riboswitches. We developed a novel approach based on inverse RNA folding that attempts to find sequences that match the shape of the target structure with minimal sequence conservation based on key nucleotides that interact directly with the ligand. Then, to support our matched candidates, we expanded the results into a covariance model representing similar sequences preserving the structure. Our method transforms a structure-based search into a sequence-based search that considers the conservation of secondary structure shape and ligand-binding residues. This method enables us to identify a potential structural candidate in fungi that could be the distant homolog of bacterial purine riboswitches. Further, phylogenomic analysis and evolutionary distribution of this structural candidate indicate that the most likely point of origin of this structural candidate in these organisms is associated with the loss of traditional purine riboswitches. The computational approach could be applicable to other domains and problems in RNA research.

     
    more » « less
  4. This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.

    Version: 2.0

    Changes versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.


    Conventions Used in These Files
    ===============================

    Structure Files
    ----------------
    - graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass), as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)

    - graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)
    - repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)
    - freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)
    - amber_*.prmtop (combined topology and parameter files for Amber force field simulations)
    - repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)

    Force Field Parameters
    ----------------------
    CHARMM format parameter files:
    - par_all36m_prot.prm (CHARMM36m FF for proteins)
    - par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.
    - toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)

    Template NAMD Configuration Files
    ---------------------------------
    These contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):
    - template_min.namd (minimization)
    - template_eq.namd (NPT equilibration with lower graphene fixed)
    - template_abf.namd (for adaptive biasing force)

    Minimization
    -------------
    - namd/min_*.0.namd

    Equilibration
    -------------
    - namd/eq_*.0.namd

    Adaptive biasing force calculations
    -----------------------------------
    - namd/eabfZRest7_graph_chp1404.0.namd
    - namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)

    Log Files
    ---------
    For each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.

    Simulation Output
    -----------------
    The simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.

    Scripts
    -------
    Files with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.


    CONTENTS
    ========

    The directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.

    Sim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite–water interface.

    Sim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite–water interface.

    Sim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.

    Sim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 295 K.

    Sim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.

    Sim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).

    Sim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.

    Sim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene–water interface in the initial configuration.

    Sim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite–water interface at 370 K.

    Sim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.

     

    This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030. 
    more » « less
  5. Ribonucleic acid (RNA) is a fundamental biological molecule that is essential to all living organisms, performing a versatile array of cellular tasks. The function of many RNA molecules is strongly related to the structure it adopts. As a result, great effort is being dedicated to the design of efficient algorithms that solve the “folding problem”—given a sequence of nucleotides, return a probable list of base pairs, referred to as the secondary structure prediction. Early algorithms largely rely on finding the structure with minimum free energy. However, the predictions rely on effective simplified free energy models that may not correctly identify the correct structure as the one with the lowest free energy. In light of this, new, data-driven approaches that not only consider free energy, but also use machine learning techniques to learn motifs are also investigated and recently been shown to outperform free energy–based algorithms on several experimental data sets. In this work, we introduce the new ExpertRNA algorithm that provides a modular framework that can easily incorporate an arbitrary number of rewards (free energy or nonparametric/data driven) and secondary structure prediction algorithms. We argue that this capability of ExpertRNA has the potential to balance out different strengths and weaknesses of state-of-the-art folding tools. We test ExpertRNA on several RNA sequence-structure data sets, and we compare the performance of ExpertRNA against a state-of-the-art folding algorithm. We find that ExpertRNA produces, on average, more accurate predictions of nonpseudoknotted secondary structures than the structure prediction algorithm used, thus validating the promise of the approach. Summary of Contribution: ExpertRNA is a new algorithm inspired by a biological problem. It is applied to solve the problem of secondary structure prediction for RNA molecules given an input sequence. The computational contribution is given by the design of a multibranch, multiexpert rollout algorithm that enables the use of several state-of-the-art approaches as base heuristics and allowing several experts to evaluate partial candidate solutions generated, thus avoiding assuming the reward being optimized by an RNA molecule when folding. Our implementation allows for the effective use of parallel computational resources as well as to control the size of the rollout tree as the algorithm progresses. The problem of RNA secondary structure prediction is of primary importance within the biology field because the molecule structure is strongly related to its functionality. Whereas the contribution of the paper is in the algorithm, the importance of the application makes ExpertRNA a showcase of the relevance of computationally efficient algorithms in supporting scientific discovery. 
    more » « less