skip to main content


Title: Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection
Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM’s performance with different supervised learning approaches that include random forests and several deep neural network architectures.  more » « less
Award ID(s):
2155095
NSF-PAR ID:
10402199
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Editor(s):
Li, Jinyan
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
18
Issue:
9
ISSN:
1553-7358
Page Range / eLocation ID:
e1010561
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Antibodies are important biomolecules that are often designed to recognize target antigens. However, they are expensive to produce and their relatively large size prevents their transport across lipid membranes. An alternative to antibodies is aptamers, short ([Formula: see text] bp) oligonucleotides (and amino acid sequences) with specific secondary and tertiary structures that govern their affinity to specific target molecules. Aptamers are typically generated via solid phase oligonucleotide synthesis before selection and amplification through Systematic Evolution of Ligands by EXponential enrichment (SELEX), a process based on competitive binding that enriches the population of certain strands while removing unwanted sequences, yielding aptamers with high specificity and affinity to a target molecule. Mathematical analyses of SELEX have been formulated in the mass action limit, which assumes large system sizes and/or high aptamer and target molecule concentrations. In this paper, we develop a fully discrete stochastic model of SELEX. While converging to a mass-action model in the large system-size limit, our stochastic model allows us to study statistical quantities when the system size is small, such as the probability of losing the best-binding aptamer during each round of selection. Specifically, we find that optimal SELEX protocols in the stochastic model differ from those predicted by a deterministic model.

     
    more » « less
  2. Abstract

    With the growing importance of the field of RNA biology, undergraduates need to perform RNA‐related research. Systematic evolution of ligands by exponential enrichment (SELEX) has become an important method in RNA biology. The principles of SELEX were applied to a semester‐long course‐based undergraduate research experience (CURE) in which two rounds ofin vivofunctional selection of regions of a viral RNA were performed. As the labwork had an unknown outcome, students indicated that they were excited by the work and became invested in the experience. By completing two rounds of SELEX, the students repeated molecular methods (e.g., RNA extraction, RT‐PCR, agarose gel electrophoresis, DNA purification, cloning, and sequence analysis) and reported that repetition reinforced their learning and helped them build confidence in their lab abilities. Students also appreciated that they did not learn a “technique‐per‐week” without context, but rather they understood why certain methods were used for certain molecular tasks. Results from a 19‐question multiple‐choice assessment indicated increased comprehension of theory underlying methods performed. Details regarding experimental methods and timeline, and assessment and attitudinal results from three student cohorts, are described herein.

     
    more » « less
  3. Ribozymes are RNA molecules that catalyze biochemical reactions. Self-cleaving ribozymes are a common naturally occurring class of ribozymes that catalyze site-specific cleavage of their own phosphodiester backbone. In addition to their natural functions, self-cleaving ribozymes have been used to engineer control of gene expression because they can be designed to alter RNA processing and stability. However, the rational design of ribozyme activity remains challenging, and many ribozyme-based systems are engineered or improved by random mutagenesis and selection ( in vitro evolution). Improving a ribozyme-based system often requires several mutations to achieve the desired function, but extensive pairwise and higher-order epistasis prevent a simple prediction of the effect of multiple mutations that is needed for rational design. Recently, high-throughput sequencing-based approaches have produced data sets on the effects of numerous mutations in different ribozymes (RNA fitness landscapes). Here we used such high-throughput experimental data from variants of the CPEB3 self-cleaving ribozyme to train a predictive model through machine learning approaches. We trained models using either a random forest or long short-term memory (LSTM) recurrent neural network approach. We found that models trained on a comprehensive set of pairwise mutant data could predict active sequences at higher mutational distances, but the correlation between predicted and experimentally observed self-cleavage activity decreased with increasing mutational distance. Adding sequences with increasingly higher numbers of mutations to the training data improved the correlation at increasing mutational distances. Systematically reducing the size of the training data set suggests that a wide distribution of ribozyme activity may be the key to accurate predictions. Because the model predictions are based only on sequence and activity data, the results demonstrate that this machine learning approach allows readily obtainable experimental data to be used for RNA design efforts even for RNA molecules with unknown structures. The accurate prediction of RNA functions will enable a more comprehensive understanding of RNA fitness landscapes for studying evolution and for guiding RNA-based engineering efforts. 
    more » « less
  4. null (Ed.)
    Rapid and accurate diagnosis of various biomarkers associated with medical conditions including early detection of viruses and bacteria with highly sensitive biosensors is currently a research priority. Aptamer is a chemically derived recognition molecule capable of detecting and binding small molecules with high specificity and its fast preparation time, cost effectiveness, ease of modification, stability at high temperature and pH are some of the advantages it has over traditional detection methods such as High Performance Liquid Chromatography (HPLC), Enzyme-linked Immunosorbent Assay (ELISA), Polymerase Chain Reaction (PCR). Higher sensitivity and selectivity can further be achieved via coupling of aptamers with nanomaterials and these conjugates called “aptasensors” are receiving greater attention in early diagnosis and therapy. This review will highlight the selection protocol of aptamers based on Traditional Systematic Evolution of Ligands by EXponential enrichment (SELEX) and the various types of modified SELEX. We further identify both the advantages and drawbacks associated with the modified version of SELEX. Furthermore, we describe the current advances in aptasensor development and the quality of signal types, which are dependent on surface area and other specific properties of the selected nanomaterials, are also reviewed. 
    more » « less
  5. Existing rate adaptation protocols have advocated training to establish the relationship between channel conditions and the optimum modulation and coding scheme. However, innate with in-field operation is encountering scenarios that the rate adaptation mechanism has not yet encountered. Frequently, protocols are optimally tuned for indoor environments but, when taken outdoors, perform poorly. Namely, the decision structure formed by offline training, lacks the ability to adapt to a new situation on the fly. The changing wireless environment calls for a rate adaption scheme that can quickly infer the channel type and adjust accordingly. Typical SNR-based rate adaptation scheme do not capture the nuance of the performance variable in different channel types. In this paper, we propose a novel scheme that allow SNR-based rate selection algorithms to be trained online in the environment in which they are operating. Inspired by the idea that, to do well, an athlete must train for the type of athletic event and environment in which they are competing, we propose FIT, an on-the-fly, in-situ training mechanism for SNRbased protocols. To do so, we first propose the FIT framework which addresses the challenges of making rate decisions with unpredictable fluctuation and lack of repeatability of real wireless channels. To distinguish between channel types in the training, we then characterize wireless channels according to the link-layer performance and introduce a novel, computationally-efficient, channel performance manifold matching technique to infer the channel type given a sequence of throughput measurements for various link-level parameters. To evaluate our methods, we implement rate selection which uses FIT for training alongside channel performance manifold matching. We then perform extensive experiments on emulated and in-field wireless channels to evaluate the online learning process, showing that the rate decision structure can be updated as channel conditions change using existing traffic flows. The experiments are performed over multiple frequency bands. The proposed FIT framework can achieve large throughput gains compared to traditional SNRbased protocols (8X) and offline-training-based methods (1.3X), particularly in a dynamic wireless propagation environments that lack appropriate training. 
    more » « less