Abstract MotivationRNA design is the search for a sequence or set of sequences that will fold to desired structure, also known as the inverse problem of RNA folding. However, the sequences designed by existing algorithms often suffer from low ensemble stability, which worsens for long sequence design. Additionally, for many methods only a small number of sequences satisfying the MFE criterion can be found by each run of design. These drawbacks limit their use cases. ResultsWe propose an innovative optimization paradigm, SAMFEO, which optimizes ensemble objectives (equilibrium probability or ensemble defect) by iterative search and yields a very large number of successfully designed RNA sequences as byproducts. We develop a search method which leverages structure level and ensemble level information at different stages of the optimization: initialization, sampling, mutation, and updating. Our work, while being less complicated than others, is the first algorithm that is able to design thousands of RNA sequences for the puzzles from the Eterna100 benchmark. In addition, our algorithm solves the most Eterna100 puzzles among all the general optimization based methods in our study. The only baseline solving more puzzles than our work is dependent on handcrafted heuristics designed for a specific folding model. Surprisingly, our approach shows superiority on designing long sequences for structures adapted from the database of 16S Ribosomal RNAs. Availability and implementationOur source code and data used in this article is available at https://github.com/shanry/SAMFEO.
more »
« less
EnsembleDesign: messenger RNA design minimizing ensemble free energy via probabilistic lattice parsing
Abstract MotivationThe task of designing optimized messenger RNA (mRNA) sequences has received much attention in recent years, thanks to breakthroughs in mRNA vaccines during the COVID-19 pandemic. Because most previous work aimed to minimize the minimum free energy (MFE) of the mRNA in order to improve stability and protein expression, which only considers one particular structure per mRNA sequence, millions of alternative conformations in equilibrium are neglected. More importantly, we prefer an mRNA to populate multiple stable structures and be flexible among them during translation when the ribosome unwinds it. ResultsTherefore, we consider a new objective to minimize the ensemble free energy of an mRNA, which includes all possible structures in its Boltzmann ensemble. However, this new problem is much harder to solve than the original MFE optimization. To address the increased complexity of this problem, we introduce EnsembleDesign, a novel algorithm that employs continuous relaxation to optimize the expected ensemble free energy over a distribution of candidate sequences. EnsembleDesign extends both the lattice representation of the design space and the dynamic programming algorithm from LinearDesign to their probabilistic counterparts. Our algorithm consistently outperforms LinearDesign in terms of ensemble free energy, especially on long sequences. Interestingly, as byproducts, our designs also enjoy lower average unpaired probabilities (which correlates with degradation) and flatter Boltzmann ensembles (more flexibility between conformations). Availability and implementationOur code is available on: https://github.com/LinearFold/EnsembleDesign.
more »
« less
- PAR ID:
- 10673080
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 41
- Issue:
- Supplement_1
- ISSN:
- 1367-4803
- Page Range / eLocation ID:
- i391 to i400
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)The problem of designing an RNA sequence v that encodes for a given target protein w plays an important role in messenger RNA (mRNA) vaccine design. Due to codon degeneracy, there exist exponentially many RNA sequences for a single target protein. These candidate RNA sequences may adopt different secondary structure conformations with varying minimum free energy (MFE), affecting their thermodynamic stability and consequently mRNA half-life. In addition, species-specific codon usage bias, as measured by the codon adaptation index (CAI), also plays an essential role in translation efficiency. While previous works have focused on optimizing either MFE or CAI, more recent works have shown the merits of optimizing both objectives. Importantly, there is a trade-off between MFE and CAI, i.e. optimizing one objective is at the expense of the other. Here, we formulate the Pareto Optimal RNA Design problem, seeking the set of Pareto optimal solutions for which no other solution exists that is better in terms of both MFE and CAI. We introduce DERNA (DEsign RNA), which uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. DERNA uses dynamic programming to solve each convex combination in O(|w|³) time and O(|w|²) space. Compared to a previous approach that only optimizes MFE, we show on a benchmark dataset that DERNA obtains solutions with identical MFE but superior CAI. Additionally, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. Finally, we demonstrate our method’s potential for mRNA vaccine design using SARS-CoV-2 spike as the target protein.more » « less
-
The variational implicit-solvent model (VISM) is an efficient approach to biomolecular interactions, where electrostatic interactions are crucial. The total VISM free energy of a dielectric boundary (i.e. solute–solvent interface) consists of the interfacial energy, solute–solvent interaction energy and dielectric electrostatic energy. The last part is the maximum value of the classical and concave Poisson–Boltzmann (PB) energy functional of electrostatic potentials, with the maximizer being the equilibrium electrostatic potential governed by the PB equation. For the consistency of energy minimization and computational stability, here we propose alternatively to minimize the convex Legendre-transformed Poisson–Boltzmann (LTPB) electrostatic energy functional of all dielectric displacements constrained by Gauss’ Law in the solute region. Both integrable and discrete solute charge densities are treated, and the duality of the LTPB and PB functionals is established. A penalty method is designed for the constrained minimization of the LTPB functional. In application to biomolecular interactions, we minimize the total VISM free energy iteratively, while in each step of such iteration, minimize the LTPB energy. Convergence of such a min–min algorithm is shown. Our numerical results on the solvation of a single ion indicate that the LTPB performs better than the PB formulation, providing possibilities for efficient biomolecular simulations.more » « less
-
Seki, Shinnosuke; Stewart, Jaimie Marie (Ed.)Molecular programmers and nanostructure engineers use domain-level design to abstract away messy DNA/RNA sequence, chemical and geometric details. Such domain-level abstractions are enforced by sequence design principles and provide a key principle that allows scaling up of complex multistranded DNA/RNA programs and structures. Determining the most favoured secondary structure, or Minimum Free Energy (MFE), of a set of strands, is typically studied at the sequence level but has seen limited domain-level work. We analyse the computational complexity of MFE for multistranded systems in a simple setting were we allow only 1 or 2 domains per strand. On the one hand, with 2-domain strands, we find that the MFE decision problem is NP-complete, even without pseudoknots, and requires exponential time algorithms assuming SAT does. On the other hand, in the simplest case of 1-domain strands there are efficient MFE algorithms for various binding modes. However, even in this single-domain case, MFE is P-hard for promiscuous binding, where one domain may bind to multiple as experimentally used by Nikitin [Nat Chem., 2023], which in turn implies that strands consisting of a single domain efficiently implement arbitrary Boolean circuits.more » « less
-
Abstract MotivationMillions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. ResultsWe developed TransFun—a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. Availability and implementationThe source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.more » « less
An official website of the United States government

