<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Investigating Active Learning and Meta-Learning for Iterative Peptide Design</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/25/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10231564</idno>
					<idno type="doi">10.1021/acs.jcim.0c00946</idno>
					<title level='j'>Journal of Chemical Information and Modeling</title>
<idno>1549-9596</idno>
<biblScope unit="volume">61</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Rainier Barrett</author><author>Andrew D. White</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Often the development of novel functional peptides is not amenable to high throughput or purely computational screening methods. Peptides must be synthesized one at a time in a process that does not generate large amounts of data. One way this method can be improved is by ensuring that each experiment provides the best improvement in both peptide properties and predictive modeling accuracy. Here, we study the effectiveness of active learning, optimizing experiment order, and meta-learning, transferring knowledge between contexts, to reduce the number of experiments necessary to build a predictive model. We present a multitask benchmark database of peptides designed to advance these methods for experimental design. Each task is a binary classification of peptides represented as a sequence string. We find neither active learning method tested to be better than random choice. The metalearning method Reptile was found to improve the average accuracy across data sets. Combining meta-learning with active learning offers inconsistent benefits.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; INTRODUCTION</head><p>Although great strides have been made in predictive computational modeling where large data sets exist, the use of computational modeling where data is scarce and/or expensive has been limited. Data expense and scarcity is the norm in materials design. <ref type="bibr">1</ref> Computer-aided design often relies on physics-based predictive methods, <ref type="bibr">2,</ref><ref type="bibr">3</ref> which require no data, but cannot predict complex properties like activity of a drug or refractive index of a thin film. Even if physics-based modeling is used, there is no clear mechanism to improve predictions as data is gathered in the course of testing. Thus, human intuition is often the state-of-the-art method for choosing which new peptides to test when there are small amounts of data.</p><p>Peptides are a popular target for biomaterials design. <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref> Unlike most synthetic polymers, the amino acid sequence of a desired peptide can be controlled at the monomer level to form specific sequences. <ref type="bibr">10</ref> Being biologically derived polymers, peptides are generally biocompatible. They are relatively easy to synthesize and some undergo self-assembly to form ordered structure spontaneously. <ref type="bibr">11,</ref><ref type="bibr">12</ref> Creating or coating an object with functional peptides is an appealing way to create biocompatible materials with various applications in medicine, <ref type="bibr">13,</ref><ref type="bibr">14</ref> drug delivery, <ref type="bibr">15,</ref><ref type="bibr">16</ref> and more. <ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">12,</ref><ref type="bibr">17,</ref><ref type="bibr">18</ref> Here, we apply active learning to binary classification of peptides across a variety of binary tasks like predicting solubility or activity against bacteria. We examine two standard active learning methods: query by committee (QBC) <ref type="bibr">19</ref> and uncertainty minimization <ref type="bibr">20</ref> with supervised learning of a deep convolutional neural network. We also examine meta-learning to see if there is a benefit in transferring knowledge from one task to another. We only allow our active learners to choose from the data set of labeled peptides (rather than generate new sequences), so that we can accurately assess the selection during training. However, these methods are ultimately evaluated not based on the peptide chosen but the accuracy of the resulting trained model. The rationale is that there are many competing design constraints in peptide design (e.g., synthetic feasibility, cost, bioavailability, etc.), and thus, it is better to have an accurate model than a finite set of examples proposed to be active.</p><p>Computer-aided design of bioactive peptides is an established field of research with a variety of approaches, spanning from quantitative structure-activity relationship (QSAR) modeling to machine learning. For example, Franco et al. <ref type="bibr">21</ref> used a genetic algorithm to optimize an antimicrobial peptide derived from the guava plant, and Hancock et al. <ref type="bibr">22</ref> used a QSAR model to find antibiofilm peptides. For further information on peptide design principles and modeling efforts, see Torres et al., <ref type="bibr">23</ref> Fjell et al., <ref type="bibr">24</ref> and Gromski et al. <ref type="bibr">25</ref> The goal of this work is not to compete with these methods but to assess active learning and meta-learning as potential ways to improve iterative discovery of peptides in this setting.</p><p>Permits non-commercial access &amp; re-use, provided that author attribution and integrity are maintained; but does not permit creation of adaptations or other derivative works (<ref type="url">https://creativecommons.org/licenses/by-nc-nd/4.0/</ref>).</p><p>Active learning has a history as an extension to design of experiments, which is about choosing the optimal experiments to do with limited resources. Our concern is a sequence of experiments where the results of the previous experiments influence our decision of the next, whereas optimal design of experiments is about choosing the best experiment prior to beginning and assumes a linear model. Active learning is this process of choosing the next experiment optimally. <ref type="bibr">26</ref> It is sometimes called optimal experimental design, <ref type="bibr">27</ref> targeted experiment design, <ref type="bibr">28</ref> sequential design of hypotheses, <ref type="bibr">29</ref> optimal learning, <ref type="bibr">30</ref> and artificial intelligence scientific discovery <ref type="bibr">31</ref> depending on the goal and problem context.</p><p>Within this framework, there are a variety of approaches depending on the form of the task model and utility function. If the task model is probabilistic, like above, utility functions which maximize information gain, <ref type="bibr">27,</ref><ref type="bibr">32</ref> reduce model uncertainty, <ref type="bibr">33</ref> maximize expected model change, <ref type="bibr">26</ref> or reduce model variance <ref type="bibr">34</ref> can be chosen. Within variance reduction methods, there are so-called A-optimal, 34 D-optimal, <ref type="bibr">34,</ref><ref type="bibr">35</ref> and E-optimal <ref type="bibr">36</ref> approaches which minimize the covariance matrix according to different assumptions. If the task model is nonparametric, Bayesian approaches are well suited. <ref type="bibr">28</ref> One still has a choice of utility function (also known as acquisition function) and can, for example, maximize the expected model improvement at each experiment. <ref type="bibr">35,</ref><ref type="bibr">37</ref> Bayesian approaches can work with recent deep learning methods through Bayesian convolutional neural networks <ref type="bibr">38</ref> or through the connection between dropout and neural network uncertainty. <ref type="bibr">39</ref> If the task model is not stochastic but there is flexibility in parameter choice or multiple models to choose from, then there are a variety of pooled or consensus active learning approaches. Query By Committee (QBC) is an active learning approach that maintains a committee of models and chooses the next experiment based on where the committee disagrees most. <ref type="bibr">40</ref> This "disagreement" is quantified using one of the various utility functions described above, for each committee member, and combining the results in some way, for example, using summation or taking the mean. There are also active learning pool methods for specific model types, for example, knearest neighbor, 41 logistic regression, <ref type="bibr">42,</ref><ref type="bibr">43</ref> and linear regression with noise. <ref type="bibr">44</ref> One can also treat the choice of model as a probability distribution, and then, the pool of models can be viewed as a stochastic or Bayesian model to use any of the previous utility functions.</p><p>There are active learning methods that are independent of the task model and instead find a characteristic set of data. <ref type="bibr">45</ref> This can be done by clustering the data, optimizing a function which describes local variance, <ref type="bibr">46</ref> finding regions of high uncertainty, <ref type="bibr">47</ref> or by finding a characteristic subset of data called a "core set" via submodular function optimization. <ref type="bibr">41,</ref><ref type="bibr">48</ref> These methods are closely related to semisupervised learning, which tries to use unlabeled data to improve a model when labeled data is sparse. <ref type="bibr">49,</ref><ref type="bibr">50</ref> Another closely related topic to active learning is Bayesian optimization or global optimization of black box functions. There the goal is to optimize a function while evaluating it a minimum number of times. To connect this to active learning described above, view the "expensive function" as the experiment. A surrogate stochastic model is constructed, often through bootstraping or nonparametric models, and that surrogate model is used within an active learning framework to reduce the number of the function evaluations. <ref type="bibr">51</ref> Then, active learning is applied so that each function evaluation improves the maximum function value and/or understanding of the model. This method can be equivalent to the variance reduction techniques discussed above if the same utility functions are chosen. <ref type="bibr">52</ref> More sophisticated active learning methods can be used with the surrogate model, including Gaussian process regression and complex portfolios of acquisition (utility) functions. <ref type="bibr">53,</ref><ref type="bibr">54</ref> A common critique of Bayesian optimization is that it typically scales between O(n 2 ) to O(n 3 ), depending on approximations made, and struggles with high-dimensional surrogate models. However, this is not applicable here, since our goal is learning in systems with dozens of experiments not thousands. A recent overview on the application of Bayesian optimization to materials design can be found in Frazier and Wang. <ref type="bibr">55</ref> An early example of active learning in chemistry can be found with van de Walle and Ceder, <ref type="bibr">56</ref> where a phase diagram was explored using variance minimization active learning on cluster expansions. Active learning works well, in general, with choosing cluster expansions via variance minimization. <ref type="bibr">57,</ref><ref type="bibr">58</ref> More recently, Lookman et al. <ref type="bibr">59</ref> explored elastic properties with ab initio calculations using a Bayesian optimization technique (Efficient Global Optimization) to optimize the ratio of three metals. The authors applied the same method to design new piezoelectrics <ref type="bibr">60</ref> with a four-dimensional design space. Gopakumar et al. <ref type="bibr">30</ref> also showed how active learning methods that balance exploration and exploitation can work well on two-and three-dimensional systems. This active learning approach to finding compositions with Bayesian optimization is quickly gaining popularity in the informatics community for low-dimensional systems. <ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref> Kim et al. <ref type="bibr">64</ref> explored active learning methods to find polymers with a specific glass transition temperature. This is a high-dimensional system because the polymers are represented with a variety of descriptors. It was made tractable by treating a list of 731 possible polymers as known a priori. At each step, the model evaluates all 731 possible polymers. An earlier example using a fixed molecule set can also be found in Warmuth et al., <ref type="bibr">65</ref> where candidate drug molecules were selected from a vendor catalogue with a variance minimization active learning algorithm.</p><p>Recently, Tallorin et al. <ref type="bibr">66</ref> explored the use of Bayesian optimization for de novo design of peptide substrates. Their work is similar in that both this work and theirs used a sequence model with the goal of minimizing the number of experiments required to train a model. The differences are that they were modeling with regression, allowed for complete choice in sequence space, did not have a goal of creating accurate models, and did not use a deep learning model but a Nai &#776;ve Bayes classifier. Their work is an experimentally validated demonstration that intelligently designing experiments with predictive models can reduce the required number peptides that need to be synthesized and tested.</p><p>Another topic explored in this work is meta-learning. Metalearning is a technique for improving few-shot learning across multiple tasks. The goal, in our nomenclature above, is to make the task model depend on hyperparameters &#958; that are trained to work well across multiple tasks. Then, on a new task, using &#958; &#770;, few new examples are required to improve performance. Active learning and meta-learning are connected because both are concerned with maximizing the value of data. As the goal is to minimize the number of experiments required for new systems, we find it natural to consider this method on our data set. Examples of meta-learning for accelerating task models can Journal of Chemical Information and Modeling be found in transfer learning, <ref type="bibr">67</ref> few-shot learning, <ref type="bibr">[68]</ref><ref type="bibr">[69]</ref><ref type="bibr">[70]</ref> and automated machine learning. <ref type="bibr">71</ref> One application that has recently connected active learning and meta-learning is in model-free reinforcement learning. <ref type="bibr">72,</ref><ref type="bibr">73</ref> Pang et al. <ref type="bibr">74</ref> and Fang et al. <ref type="bibr">75</ref> have also explored the connection between active learning and meta-reinforcement learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; METHODS</head><p>Functional peptide data sets were prepared by mapping each amino acid to a one-hot encoded vector of dimension [N &#215; 20], where N is the length of the peptide. Five past published databases were used to generate training data sets in this work: the Antimicrobial Peptide Database (APD), <ref type="bibr">76</ref> the collection of protein surface fragments from White et al., <ref type="bibr">77</ref> the PROSO II database, <ref type="bibr">78</ref> library hits with activity against TULA-2 protein, <ref type="bibr">79</ref> and library hits with activity against SHP-2. <ref type="bibr">80</ref> The APD contains peptides flagged with a variety of activities, such as antibacterial, antifungal, antiviral, anticancer, and hemolytic activities. The PROSO II databse contains peptides and proteins categorized as soluble or insoluble. The TULA-2 and SHP-2 libraries contain fixed-width peptides optimized for binding to a specific target. Eight data sets were chosen from the APD: (1) "antibacterial", (2) "anticancer", (3) "antifungal", (4) "anti-HIV", (5) "anti-MRSA", (6) "antiparasital", (7) "antiviral", and ( <ref type="formula">8</ref>) "hemolytic". All sequences above length 200 were excluded. This wide variety of tasks represents a range of modeling goals in peptide research. Table <ref type="table">1</ref> shows the negative data sets used for training each classifier. Here, "human" refers to the aforementioned collection of human protein surface fragments.</p><p>Active learning is typically formulated as an optimization problem. <ref type="bibr">26</ref> Consider N observation pairs x i , y i of features and labels, respectively, with i &#8712; [0, 1, ..., N] indicating the order of observation. Assume that y i is a class label, and x i is a feature vector of real numbers. We have a task model,</p><p>, that assigns a probability to each class label for a feature x and is defined by parameters &#952; i , which are updated after each new observation. In this work, P &#952; is a deep-learning convolutional neural network. &#952; i is updated according to some training procedure after a new x i , y i pair is observed. In active learning, we choose x i+1 from our fixed data set of x, y pairs according to</p><p>where A &#968; (&#8226;) is a functional of the task model and possibly x. The j subscript only indicates that for a given x j we must use the corresponding y j to evaluate the functional. A &#968; (&#8226;) can be defined by parameters &#968;, although it is normally fixed. For example, A &#968; (&#8226;) could be the most uncertain point, which gives</p><p>where y&#770;is the most likely class label for x under &#952; P i . <ref type="bibr">20</ref> A &#968; (&#8226;)i s called the acquisition function or utility function depending on the problem setting. <ref type="bibr">26</ref> Here, the task model is a deep convolutional neural network classifier, necessitating negative training examples as well as positive. One corresponding negative training data set was generated for each positive data set. Two types of negative data were generated: scrambled data with amino acid distributions identical to the "soluble" data set but length distributions identical to the corresponding positive data set, and samples from data sets which are expected to have no intersection with positive examples (Table <ref type="table">1</ref>). These are expected to be rather challenging negative examples, since scrambled peptides likely have many physical properties of the positive examples. Generally, the nonintersecting data sets are also naturally occurring peptides that likely are biologically relevant. To generate the scrambled negative data set, a number of peptides were generated randomly, with lengths sampled from the same length range as their respective positive set, and residues sampled from the frequency distribution of the soluble data set.</p><p>The nonintersecting examples are shown in Table <ref type="table">1</ref> along with rationale in the caption. To balance classes, the negative data sets were sampled down to be the same size as the corresponding set of positive examples.</p><p>Peptide sequences are encoded as one-hot vectors of dimension N &#215; 20 (with N the length of the peptide), where each index in the second dimension corresponds to the index of the amino acid in the nth position in the alphabet of amino acids: <ref type="figure">[A,</ref><ref type="figure">R,</ref><ref type="figure">N,</ref><ref type="figure">D,</ref><ref type="figure">C,</ref><ref type="figure">Q,</ref><ref type="figure">E,</ref><ref type="figure">G,</ref><ref type="figure">H,</ref><ref type="figure">I,</ref><ref type="figure">L,</ref><ref type="figure">K,</ref><ref type="figure">M,</ref><ref type="figure">F,</ref><ref type="figure">P,</ref><ref type="figure">S,</ref><ref type="figure">T,</ref><ref type="figure">W,</ref><ref type="figure">Y,</ref><ref type="figure">V</ref>]. Activity was encoded as a one-hot label vector of length 2, where [1, 0] indicates a positive label and [0, 1] indicates a negative label.</p><p>The task model is a convolutional neural network whose structure was partially based on that of a model from our past </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Journal of Chemical Information and Modeling</head><p>work in peptide modeling. <ref type="bibr">81</ref> The model structure is shown schematically in Figure <ref type="figure">1</ref>. The first layer of the neural network is a convolutional layer with a weight matrix of dimension [W &#215; A &#215; K], where W and K conceptually represent peptide "motif widths" and number of "motif classes," respectively, and A is the length of the amino acid alphabet considered (20 for the naturally occurring peptides used here). The next layer is a mean pool layer, which captures which "motif class" is most likely in the input peptide by pooling across the peptide's length dimension, leaving a 1 &#215; K vector. The output of the max pool layer is concatenated with the relative frequency of each amino acid in the input peptide (1 &#215; 20 vector) and standardized (0-1) sequence length. This is then fed to three fully connected (FC) layers with the ReLU activation function and then to one FC layer with softmax activation for classification, ensuring that the outgoing vector of size 2 adds up to 1, since this vector is meant to represent the likelihood of assignment to the positive or negative classes. The final output is compared with the true label vector (classification) or the true activity (regression), and loss is calculated as the cross-entropy between the two. The minimization algorithm used during training was Tensor-Flow's <ref type="bibr">82</ref> Adam optimizer with parameters recommended in Kingma and Ba <ref type="bibr">83</ref> and a learning rate of 0.001. This architecture is relatively simple compared to state-of-the art deep leanring for sequence prediction tasks of peptides. Recently, Veltri et al. <ref type="bibr">84</ref> developed a deep learning model for predicting antimicrobial activity. As our goal is to explore active and meta-learning, we use a simpler architecture. Namely, we do not use embeddings, and we use smaller and fewer filters and fully connected layers after features instead of a recurrent neural network.</p><p>The method of uncertainty minimization is designed to favor exploration to maximize information gained per training iteration of a model. This is achieved by choosing a new training point based on some measure of the uncertainty of the machine learning model used. In this sense, it is well aligned with an eventual goal of automated experiment selection because it can minimize the number of necessary experiments to characterize a property space well. This, in turn, would lead to a reduction in operating costs and time spent performing experiments.</p><p>For uncertainty minimization training, the model described previously was used, with W and K both chosen to be 6. Before training, model weights are randomly initialized. The model uncertainty is calculated as the variance of the output vector of the neural network for each peptide. One peptide is then sampled, with probabilities of selection for each peptide weighted by their respective variances under the current model parameters, i.e., p(1 -p). This is different than eq 1, where argmax is used instead of sampling. The chosen peptide is then used to train the neural network for one training iteration; then, an additional 16 training iterations are performed with batch size 16, sampling uniformly from all previously observed peptides for training input. Thus, in the first iteration, the first point is used for 50 training steps; then, in the second, the first and second points used an expected value of 12.5 times each, etc. This process is repeated for 50 training epochs for one training run. To gather statistics, 30 training runs of 50 epochs each were performed for each data set. We also investigated model performance after 100 training runs of 10 epochs to evaluate the few-shot learning capability of the various combinations of active and meta-learning methods used here.</p><p>The QBC method employed in this work uses the same approach as uncertainty minimization, but instead of a single task model that is trained and used for selection, a committee of 10 models is used. The nine additional models are all structured in the same way as the model used in uncertainty minimization but used different hyperparameters. They differ in the dimensions of the weights matrix used in the convolution layer, having all combinations of 3 &#8804; w &#8804; 5 and 3 &#8804; k &#8804; 5 along with the W =6 ,K = 6 model used in uncertainty minimization. In QBC, input data are passed to each committee member (task model), and each one produces a prediction. The average variance among all models is used as the sampling weight for selecting a new training point, and training is performed in the same way as for uncertainty minimization, with the same number of epochs and training runs.</p><p>To compare these active learning methods against a base case, we use two control training methods with the same model as used for the uncertainty minimization method. The first control is a "baseline" Adam training, where the model trains on all peptides for 5000 steps with a batch size of 32. The second is a more direct comparison to the active learning methods called "random", where peptides are chosen randomly, and the model is trained in the same way as in the active learning methods (batch size 16, 16 iterations, 50 epochs).</p><p>The meta-learning method used in this work is Reptile from Nichol et al. <ref type="bibr">85</ref> It is related to the work by Finn et al., <ref type="bibr">86</ref> called model agnostic meta-learning. In our notation, the goal of meta-learning is to optimize the initial parameters of the task Journal of Chemical Information and Modeling model, &#952; 0 , to work well given a sampled task &#964;. P(&#964;) is taken to be uniform across our data sets. &#952; 0 is optimized by Adam optimization of a meta-objective function</p><p>where &#964; is the data set corresponding to a set (x, y), &#968; are the parameters defining the active learning method A &#968; (&#8226;), and &#952; 0 is the given initial task model parameters.</p><p>is the usual loss function, and U is a stand-in for doing J steps of active learning training with A &#968; (&#8226;). The gradient of this meta-objective requires a Hessian, but Reptile approximates this with a Taylor expansion. In this work, J = 16, meaning we train with active learning on 16 peptides each time with a batch size of 16. Here, 2500 meta-learning iterations were done, and then, early stopping was used to prevent overfitting. This was done  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Journal of Chemical Information and Modeling</head><p>on an 80% split of the left-out data set, and then, results were reported on the 20% remaining sequences of the left-out data set.</p><p>AUC values and accuracies are reported on withheld data which was 20% of the data set size. Training curves are shown in Figures <ref type="figure">2</ref> and<ref type="figure">4</ref>, and exact final values for withheld set accuracy and ROC AUC values are reported in the Supporting Information in Tables <ref type="table">S1-S4</ref>. The accuracies and AUCs reported here are on the withheld data set from training. During meta-learning, the location of active/inactive labels (first or second index) was swapped between tasks to prevent overfitting to active being in one position or another. This gives minimal zero-shot accuracy but helps illustrate the rate of training, as shown in Figures <ref type="figure">2</ref> and<ref type="figure">4</ref>. The minimized loss function was cross entropy between label probabilities and true labels. Error bars and individual traces are different due to split differences of data and random number initial parameter seeds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; RESULTS AND DISCUSSION</head><p>Figure <ref type="figure">2</ref> shows the active learning, meta-learning, and baseline results across the 12 data sets for the uncertainty minimization active learning strategy. The baseline results (gray) show that the convolutional neural network provides reasonable results across the range of modeling tasks despite its simple architecture. The subpanels are arranged in increasing order of training set size, which were 65, 87, 90, 119, 120, 150, 183, 253, 891, 2079, 2880, 7028, respectively. Overall, training set size does not correlate to learning performance. The solubility task showed the worst overall performance despite being the largest data set. Solubility classification is challenging because many of the training examples are folded proteins, requiring long-range sequence correlations to model properly. The length range of the solubility set is large, ranging from 19 to 198 amino acids. However, the ranges of the antifungal (2-143) and antibacterial (2-183) sets are similar, so this is not likely the source of difficulty. Indeed, solubility prediction is known to be a difficult task, with state-of-the-art models achieving reported accuracies between 0.52 and 0.77. <ref type="bibr">87,</ref><ref type="bibr">88</ref> Our simple convolutional neural network has an accuracy of 0.59, which is within this range.</p><p>Figure <ref type="figure">2</ref> also shows the comparison of choosing peptides randomly and choosing peptides for which the model has maximum uncertainty (uncertainty minimization). Uncertainty minimization (blue line) is not better than choosing randomly (dashed green), in general. It is sometimes worse and sometimes better in the long term, but for few-shot learning of up to 10 training examples, it is not significantly different from random choice (Figure <ref type="figure">3</ref>).</p><p>To assess the effect of meta-learning on reducing the experiment number, it was evaluated both alone and in combination with active learning. Results from combining meta-learning with uncertainty minimization are shown in Figure <ref type="figure">2</ref>. Meta-learning consistently improves initial gains in accuracy, except in the soluble, SHP-2, and TULA-2 tasks. This is likely because these tasks are quite different from the other nine, so feature reuse is less important. Note that metalearning results were controlled for label correlation due to data set overlap by randomly swapping labels (negatives become positive and vice versa) between meta-training trials. Thus, although many negative data sets overlap&#57557;as do the positive sets in the cases of anticancer and antibacterial&#57557;metalearning is unable to overfit to the labels of any one set.</p><p>Combining uncertainty minimization with meta-learning only sometimes increases accuracy over random choice but Journal of Chemical Information and Modeling significantly improves few-shot accuracy on average after 10 training examples (Figure <ref type="figure">3</ref>). In some data sets, final accuracy approaches or even exceeds baseline levels with less than 25 examples, whereas the baseline is trained on all data. This demonstrates the advantage of using meta-learning.</p><p>Receiver operator characteristic (ROC) curves provide an accounting of the balance between type I and type II errors. This is important for peptide activity because, due to the large design space, false positives are more detrimental to a model's usefulness. The area under curve (AUC) of a ROC curve gives a scalar representing the quality of the ROCs. Note that here we enforced balanced classes (same number of positive/ negative examples in the training set). This ensures that accuracy and ROC are good measures of performance, while other metrics become necessary in cases with unbalanced training sets. The ROC AUCs are reported in the Supporting Information in Tables <ref type="table">S1</ref> and<ref type="table">S2</ref> and Figure <ref type="figure">S1</ref>. As observed in Figures <ref type="figure">2</ref> and<ref type="figure">3</ref>, there is no significant gain to be had in using uncertainty minimization active learning. Also, 50 examples are not enough to match the baseline models without metalearning. The source of variance in this work is because there are many ways to choose 50 peptides from the data sets, and the tasks of the data sets are disparate. Also, some data sets are small. Significant exceptions are the SHP-2 and soluble data sets, which show poor performance with limited examples relative to baseline models. In particular, SHP-2 seems to require the full data set to achieve good accuracy. This may be due to the importance of motifs in this data set <ref type="bibr">89</ref> and lack of feature reuse between data sets due to its unique task (binding affinity with specific enzyme).</p><p>The QBC training results are reported in Figure <ref type="figure">4</ref>. QBC does not consistently provide better performance than random choice or uncertainty minimization. QBC significantly improves with meta-learning. As shown in Figure <ref type="figure">3</ref>, QBC seems to have the best few-shot and final performance when combined with meta-learning, although uncertainty minimization is as good for few-shot only. The Supporting Information contains tabulated AUCs and accuracies in Tables <ref type="table">S1-S8</ref>.</p><p>Overall, meta-learning usually improves accuracy when combined with active learning methods across these data sets, as shown in Figure <ref type="figure">3</ref>. Recent analysis of meta-learning has shown mixed results across tasks. Raghu et al. <ref type="bibr">90</ref> showed that model agnostic meta-learning methods like Reptile only learn to reuse features across tasks. To assess how feature reuse can be applicable in these data sets, we performed a basic sensitivity analysis in Figure <ref type="figure">5</ref> which gives insight into the various features used in the modeling. Figure <ref type="figure">5</ref> shows the features for the baseline model on the antibacterial data set and the zero-shot features for meta-learning. Note that all motif frequencies are 1.0 since this is the meta-learned parameters, not a realization of training. The results show that metalearning does not have the same features found in the baseline model, although some of the important amino acids are shared (N, C, H). Some of the amino acids within the motifs are shared, but they are not identical.</p><p>To ensure our that conclusions about meta-learning and QBC being preferred for peptide design are robust, we explored four alternative model choices. These are shown in Figure <ref type="figure">6</ref> for only the antibacterial task (although meta-learning traces were trained on all but antibacterial). The second subplot is ablation of the motif convolutions, i.e., using only amino acid count vectors for training. Without the motifs, accuracy is roughly the same, and only the combination of QBC with meta-learning is advantageous over random choice. Training with only motifs (convolutions) and no amino acid counts reduces performance, in general, and entirely removes the benefits of meta-learning, active learning, and the combination of the two. Using the original model structure  No Label Swap" means that labels were not swapped during meta-learning so that zero-shot accuracy is maximized. These results show that meta-learning is not always better but is consistently a good choice for few-shot learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Journal of Chemical Information and Modeling</head><p>but switching from a cross-entropy loss to absolute error loss reduces accuracy for all methods. The last plot shows what happens if we remove the label swapping during meta-learning, which is used to reduce overfitting to "active" peptides. Zeroshot performance is improved, due to good correlation of activity in peptides across tasks. Meta-learning is preferred in this setting, when it is known that the class labels of active vs inactive can be reused across tasks. These results indicate that the benefits of meta-learning paired with active learning can be sensitive to the model structure used. In particular, if the chosen loss function is poor, few-shot accuracy improvement may be impossible. The benefits of meta-learning combined with active learning seem to depend on the neural network features, as evidenced by the difference between the "No Motifs" and "No Counts" subplots of Figure <ref type="figure">6</ref>.</p><p>Finally, to evaluate whether beta calibration interacted favorably with these methods, the uncertainty minimization trials were repeated with beta calibration of the network output. In this case, beta calibration fitting is done separately after each training iteration, i.e., once per choice of new training point(s). The intent of beta calibration is to ensure that the scores output by the neural network reflect the actual probabilities that a given point belongs to the positive class. However, beta calibration did not have an appreciable effect on the accuracy or training efficiency (Figure <ref type="figure">S2</ref>).</p><p>We assess statistical significance in Figure <ref type="figure">3</ref>. The box-andwhisker plots show average accuracy values for the 10-example and 50-example training runs described in the Methods section. We performed Wilcoxon's signed ranks test on the 12 training accuracy values for a given method, comparing between all possible pairs of the six methods explored here, obtaining p-values for each pair. Asterisks are shown in Figure <ref type="figure">3</ref> for pairs with p &#8804; 0.05. Methods with no asterisk were not significantly different. Exact values for all p-values for accuracy can be found in the Supporting Information in Tables <ref type="table">S7</ref> and<ref type="table">S8</ref>. We performed a similar analysis for the average AUC values for each method, also found in the Supporting Information in Figure <ref type="figure">S1</ref> with p-values in Tables <ref type="table">S5</ref> and<ref type="table">S6</ref>. We can see from Figure <ref type="figure">3</ref> that after 50 training examples nearly all the methods explored here are not significantly different on average from random choice (with or without meta-learning). In the few-shot case, with only 10 training examples, the differences between methods are more pronounced. In this case, combining meta-learning with either uncertainty minimization or QBC yields significantly better average accuracy than either active learning method alone. This supports the hypothesis that meta-learning can enhance the accuracy of few-shot learning across multiple tasks, even for a relatively simple convolutional neural network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; CONCLUSIONS</head><p>This work has presented a data set of 12 different peptide classification tasks and explored active learning and metalearning strategies for predicting peptide activity on them. The simple deep convolutional neural network used here offers greater than 85% accuracy across all tasks except soluble tasks, which is known to be a difficult task and was within reported ranges for other state-of-the-art models. We expect that more complex models with attention, more layers, and long-range interactions in sequence space could improve the accuracy. The two active learning strategies explored here were not found to provide significant improvements over sampling peptides randomly. Meta-learning was found to improve few-shot accuracy with 10 training examples only when combined with active learning. Here, every meta-learning method performed significantly better on average than uncertainty minimization with no meta-learning. Both meta-learning with uncertainty minimization and meta-learning with QBC were significantly better than either active learning srategy alone. In contrast, after 50 training examples, only meta-learning with QBC was found to significantly increase average accuracy over random choice, indicating that the benefits of meta-learning are short-lived but well-suited to a data-scarce context. These conclusions also hold in the zero-shot learning setting, but model ablation shows that they are dependent on loss choice and model features. This work provides a new peptide multitask data set and benchmark results for standard active learning and meta-learning methods and shows that metalearning could have significant benefits for experimental settings where data is scarce.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded via UNIV OF ROCHESTER on May 27, 2021 at 14:18:17 (UTC).See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>https://dx.doi.org/10.1021/acs.jcim.0c00946 J. Chem. Inf. Model. 2021, 61, 95-105</p></note>
		</body>
		</text>
</TEI>
