skip to main content


Title: Integrated mRNA sequence optimization using deep learning
Abstract

The coronavirus disease of 2019 pandemic has catalyzed the rapid development of mRNA vaccines, whereas, how to optimize the mRNA sequence of exogenous gene such as severe acute respiratory syndrome coronavirus 2 spike to fit human cells remains a critical challenge. A new algorithm, iDRO (integrated deep-learning-based mRNA optimization), is developed to optimize multiple components of mRNA sequences based on given amino acid sequences of target protein. Considering the biological constraints, we divided iDRO into two steps: open reading frame (ORF) optimization and 5′ untranslated region (UTR) and 3′UTR generation. In ORF optimization, BiLSTM-CRF (bidirectional long-short-term memory with conditional random field) is employed to determine the codon for each amino acid. In UTR generation, RNA-Bart (bidirectional auto-regressive transformer) is proposed to output the corresponding UTR. The results show that the optimized sequences of exogenous genes acquired the pattern of human endogenous gene sequence. In experimental validation, the mRNA sequence optimized by our method, compared with conventional method, shows higher protein expression. To the best of our knowledge, this is the first study by introducing deep-learning methods to integrated mRNA sequence optimization, and these results may contribute to the development of mRNA therapeutics.

 
more » « less
NSF-PAR ID:
10391397
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
24
Issue:
1
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BACKGROUND

    Calmodulin (CaM) is an essential protein in cellular activity and plays important roles in many processes in insect development. RNA interference (RNAi) has been hypothesized to be a promising method for pest control. CaM is a good candidate for RNAi target. However, the sequence and function of CaM inNilaparvata lugensare unknown. Furthermore, the double‐stranded RNA (dsRNA) target to CaM gene in pest control is still unavailable.

    RESULTS

    In the present study, two alternatively spliced variants ofCaMtranscripts, designatedNlCaM1andNlCaM2, were cloned fromN. lugens. The two cDNA sequences exhibited 100% identity to each other in the open reading frame (ORF), and only differed in the 3′ untranslated region (UTR).NlCaMincludingNlCaM1andNlCaM2mRNA was detectable in all developmental stages and tissues ofN. lugens, with significantly increased expression in the salivary glands. Knockdown ofNlCaMexpression by RNAi with different dsRNAs led to an inability to molt properly, increased mortality, which ranged from 49.7 to 92.5%, impacted development of the ovaries and led to female infertility. There were no significant reductions in the transcript levels of vitellogenin and its receptor or in the total vitellogenin protein level relative to the control group. However, a significant reduction in vitellogenin protein was detected in ovaries injected with dsNlCaM. In addition, a specific dsRNA ofNlCaMfor control ofN. lugenswas designed and tested.

    CONCLUSION

    NlCaMplays important roles mainly in nymph development and uptake of vitellogenin by ovaries in vitellogenesis inN. lugens. dsRNA derived from the less conserved 3′‐UTR ofNlCaMshows great potential for RNAi‐basedN. lugensmanagement. © 2018 Society of Chemical Industry

     
    more » « less
  2. Simon, Anne E. (Ed.)
    ABSTRACT Regardless of the general model of translation in eukaryotic cells, a number of studies suggested that many mRNAs encode multiple proteins. Leaky scanning, which supplies ribosomes to downstream open reading frames (ORFs) by readthrough of upstream ORFs, has great potential to translate polycistronic mRNAs. However, the mRNA elements controlling leaky scanning and their biological relevance have rarely been elucidated, with exceptions such as the Kozak sequence. Here, we have analyzed the strategy of a plant RNA virus to translate three movement proteins from a single RNA molecule through leaky scanning. The in planta and in vitro results indicate thatthe significantly shorter 5′ untranslated region (UTR) of the most upstream ORF promotes leaky scanning, potentially fine-tuning the translation efficiency of the three proteins in a single RNA molecule to optimize viral propagation. Our results suggest that the remarkably short length of the leader sequence, like the Kozak sequence, is a translational regulatory element with a biologically important role, as previous studies have shown biochemically. IMPORTANCE Potexvirus , a group of plant viruses, infect a variety of crops, including cultivated crops. It has been thought that the three transition proteins that are essential for the cell-to-cell transfer of potexviruses are translated from two subgenomic RNAs, sgRNA1 and sgRNA2. However, sgRNA2 has not been clearly detected. In this study, we have shown that sgRNA1, but not sgRNA2, is the major translation template for the three movement proteins. In addition, we determined the transcription start site of sgRNA1 in flexiviruses and found that the efficiency of leaky scanning caused by the short 5′ UTR of sgRNA1, a widely conserved feature, regulates the translation of the three movement proteins. When we tested the infection of viruses with mutations introduced into the length of the 5′ UTR, we found that the movement efficiency of the virus was affected. Our results provide important additional information on the protein translation strategy of flexiviruses, including Potexvirus , and provide a basis for research on their control as well as the need to reevaluate the short 5′ UTR as a translational regulatory element with an important role in vivo . 
    more » « less
  3. Abstract The coronavirus disease 2019 (COVID-19) is a highly contagious and fatal disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In general, the diagnostic tests for COVID-19 are based on the detection of nucleic acid, antibodies, and protein. Among different analytes, the gold standard of the COVID-19 test is the viral nucleic acid detection performed by the quantitative reverse transcription polymerase chain reaction (qRT-PCR) method. However, the gold standard test is time-consuming and requires expensive instrumentation, as well as trained personnel. Herein, we report an ultrasensitive electrochemical biosensor based on zinc sulfide/graphene (ZnS/graphene) nanocomposite for rapid and direct nucleic acid detection of SARS-CoV-2. We demonstrated a simple one-step route for manufacturing ZnS/graphene by employing an ultrafast (90 s) microwave-based non-equilibrium heating approach. The biosensor assay involves the hybridization of target DNA or RNA samples with probes that are immersed into a redox active electrolyte, which are detectable by electrochemical measurements. In this study, we have performed the tests for synthetic DNA samples and, SARS-CoV-2 standard samples. Experimental results revealed that the proposed biosensor could detect low concentrations of all different SARS-CoV-2 samples, using such as S, ORF 1a, and ORF 1b gene sequences as targets. This microwave-synthesized ZnS/graphene-based biosensor could be reliably used as an on-site, real-time, and rapid diagnostic test for COVID-19. 
    more » « less
  4. Abstract Background

    Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.

    Results

    Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp’s capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor.

    Conclusions

    Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.

     
    more » « less
  5. The 5 ′ untranslated region (UTR) sequence of eukaryotic mRNAs may contain upstream open reading frames (uORFs), which can regulate translation of the main ORF (mORF). The current model of translational regulation by uORFs posits that when a ribosome scans a mRNA and encounters an uORF, translation of that uORF can prevent ribosomes from reaching the mORF and cause decreased mORF translation. In this study, we first observed that rare variants in the 5 ′ UTR dysregulate maize ( Zea mays L. ) protein abundance. Upon further investigation, we found that rare variants near the start codon of uORFs can repress or derepress mORF translation, causing allelic changes in protein abundance. This finding holds for common variants as well, and common variants that modify uORF start codons also contribute disproportionately to metabolic and whole-plant phenotypes, suggesting that translational regulation by uORFs serves an adaptive function. These results provide evidence for the mechanisms by which natural sequence variation modulates gene expression, and ultimately, phenotype. 
    more » « less