<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Predicting RNA Mutation Effects through Machine Learning of  High-throughput Ribozyme Experiments (Student  Abstract).</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10311321</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings  of the Thirty-Sixth AAAI Conference on Artificial Intelligence - Student Abstract and Poster Program</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>J. Kitzhaber</author><author>A. Trapp</author><author>J. Beck</author><author>E. Serra</author><author>F. Spezzano</author><author>E. Hayden</author><author>J. Roberts</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The ability to study "gain of function" mutations has important implications for identifying and mitigating risks to public health and national security associated with viral infections. Numerous respiratory viruses of concern have RNA genomes (e.g., SARS and flu). These RNA genomes fold into complex structures that perform several critical functions for viruses. However, our ability to predict the functional consequence of mutations in RNA structures continues to limit our ability to predict gain of function mutations caused by altered or novel RNA structures. Biological research in this area is also limited by the considerable risk of direct experimental work with viruses. Here we used small functional RNA molecules (ribozymes) as a model system of RNA structure and function. We used combinatorial DNA synthesis to generate all of the possible individual and pairs of mutations and used high-throughput sequencing to evaluate the functional consequence of each single-and double-mutant sequence. We used this data to train a Long Short-Term Memory model. This model was also used to predict the function of sequences found in the genomes of mammals with three mutations, which were not in our training set. We found a strong prediction correlation in all of our experiments.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Unlike human genomes that are made of DNA, many viruses have genomes made of a molecule called RNA (ribonucleic acid). One important property of RNA molecules is that they can form various complex shapes that are very different than the well-known double-helix of DNA. For viruses with RNA genomes, these RNA shapes can perform important functions, such as the binding of virus or human proteins that enable replication and infection. RNA molecules are formed from long chains of small molecular building blocks called nucleotides. The shape, also called structure, formed by a specific RNA molecule depends on the order of the connected nucleotides, which we call the RNA sequence. Changing the RNA sequence changes the shape, which changes the function. Random mutations that occur during the virus replication change the sequence, which can in turn change the shape and function of the RNA. Some of these mutations, termed "gain of function", would make the virus better at replicating or infecting, which makes them more harmful. Unfortunately, our inability to predict which sequence changes will change RNA functions limits our ability to predict gain of function mutations. The ability to predict gain of function mutations in RNA could help us prepare and respond to viral epidemics and pandemics.</p><p>Self-cleaving ribozymes are small functional RNA molecules that can be found in the genomes of all living organisms. Ribozymes are a good model of sequence to function relationships in RNA because changing the sequence of a ribozyme can change the function, and this can be studied safely and easily in the lab. "Gain of function mutations" enhance the ribozyme function, but the ribozymes are not infectious or harmful to humans. In addition, it is easy to make changes to the RNA sequence in the lab. Recent experimental advancements have made it possible to study millions of nucleotide changes to a ribozyme sequence all at once, providing rich data sets for learning and predicting which nucleotide changes result in gain of function mutations in this model RNA system.</p><p>Here we use such high-throughput sequence-function data from a ribozyme called CPEB3. This ribozyme was originally found in the human genome but now is known to have been present in the ancient genomes of the earliest mammals. The ribozyme is found in the genomes of all mammals but with some differences in the nucleotide sequences. This provides a good model system for predicting which nucleotide changes in mammals enhanced ribozyme function. Toward this goal, we first generated all the possible nucleotide changes to the most ancient CPEB3 ribozyme, all the possible pairs of nucleotide changes, and evaluated the ribozyme function for all these sequences (&#8764;20,000 sequences). We used a portion of this data as a training set for a Long Short-Term Memory model (LSTM). We used this model to predict the function of the withheld training data and then to predict the function of sequences with three mutations that were not in our training data but are found in the genomes of other mammals. We analyzed the accuracy of our predictive model using correlation between the predicted ribozyme activity and the lab determined activity.</p><p>Previous works focusing on functional prediction aim to predict mutated sequences with the same number of mutations as in the training data <ref type="bibr">(Zhang et al. 2020;</ref><ref type="bibr">Schmidt and Smolke 2021;</ref><ref type="bibr">Calonaci et al. 2020)</ref> enhance structural homology to predict, i.e., estimating how similar is the sequence to predict with the one in input. To the best of our knowledge, this is the first time that a model trained on single and double mutations is effective to predict sequences with a larger number of mutations. This is possible because our LSTM model does not use homology but learns specific small patterns in the sequence that can also generalize to sequences with a higher number of mutations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head><p>Data Generation. The nucleotide changes to the CPEB3 ribozyme were made and studied through molecular biology techniques as follows. The RNA molecules were made in the lab through a process called in vitro transcription, where a protein is used to make multiple RNA copies of DNA "templates". Ribozymes are unique RNA sequences that cleave their own sequence while they are being made (self-cleaving). The DNA templates used in this study were actually a collection of numerous different DNA sequences, and the RNA that was made included lots of copies of every possible individual and pair of nucleotide changes of the CPEB3 ribozyme. Some of these nucleotide changes were expected to enhance the ribozyme function, such that more of the molecules will cleave while they are being made.</p><p>Other nucleotide changes were expected to diminish the ribozyme activity, such that less of these molecules will cleave while being made. All the different RNA sequences were made simultaneously so that the amount that they cleave could be directly compared. The mount cleaved was determined by sequencing all the RNA molecules (RNA-seq). This sequencing data reports on both the nucleotide changes that were present in a specific molecule and whether or not that specific molecule was cleaved. Because each nucleotide change was observed multiple times, counting the number of cleaved and not cleaved molecules was used to determine the function of each sequence. Specifically, the function was defined as the fraction cleaved F = count cleaved/count total.</p><p>The lab experiments were replicated three times because the molecular biology methods used involve several stochastic processes. Each replicate was used separately to train and test the prediction model and contains 207 sequences with one mutation and 21,114 sequences with double mutation.</p><p>Prediction Model. We built a machine learning model to predict how nucleotide changes will change the ribozyme function (fraction cleaved). Each dataset contains nucleotide sequences and the associated fraction cleaved determined experimentally, which we used as ground truth. Because the fraction cleaved is a real number in [0,1], we addressed the problem as a regression task. Given that the input is a sequence of nucleotides, we used a Long Short-Term Memory modelto predict the fraction cleaved. We processed the input sequence of nucleotides as a text sequence. We used the root mean square error (RMSE) as the loss function. We used R 2 score, RMSE, and Pearson correlation coefficient (PCC) to measure the model performances of the proposed model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments and Results</head><p>We conducted two types of experiments. In the first experiment (Experiment 1), we trained and predicted on individual and pairs of nucleotide mutations. We performed 10fold cross-validation for each of the Replicates. Results are shown in Table <ref type="table">1</ref> (first three rows). In the second experiment (Experiment 2), we wanted to see if a model learned on one and two mutations was able to predict the function of sequences with a higher number of mutations (three in the considered case). Hence, we used all the Replicate data as training set and data consisting of sequences found in the genomes of mammals with three mutations, which were not in our training set (518 sequences). Results are shown in Table 2 (second three rows). We found that there was a strong prediction correlation in all models. For both experiments 1 and 2, the Pearson Correlation Coefficient on Replicates 2 and 3 was &gt;0.9, and the R 2 ranged from 0.64 to 0.84. These results indicate that tractable high-throughput experiments in the lab can be used to determine the effect of combinations of mutations found to evolve in nature. Results were slightly worse on Replicate 1. Because of variation in the total number of copies for all mutational variants within each replicate, the average number of identified copies per variant is different. Additionally, because this difference is not uniformly distributed across all mutational variants, some variant's counts may be less accurate due to low copy volume. Consequently, we should not expect that each replicate will be of equal utility for downstream predictions. Overall, these preliminary results are promising for predicting "gain of function" mutations in other settings, such as RNA viruses, which could guide RNA vaccine production.</p></div></body>
		</text>
</TEI>
