<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Visualizing and Annotating Protein Sequences using A Deep Neural Network</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10291891</idno>
					<idno type="doi">10.1109/IEEECONF51394.2020.9443364</idno>
					<title level='j'>Visualizing and Annotating Protein Sequences using A Deep Neural Network</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhengqiao Zhao</author><author>Gail Rosen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[It is critical for biological studies to annotate amino acid sequences and understand how proteins function. Protein function is important to medical research in the health industry (e.g., drug discovery). With the advancement of deep learning, accurate protein annotation models have been developed for alignment free protein annotation. In this paper, we develop a deep learning model with an attention mechanism that can predict Gene Ontology labels given a protein sequence input. We believe this model can produce accurate predictions as well as maintain good interpretability. We further show how the model can be interpreted by examining and visualizing the intermediate layer output in our deep neural network.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Understanding protein function at the molecular level has great implications in the biomedical and pharmaceutical industry <ref type="bibr">[1]</ref>. For example, protein annotation facilitates the development of novel tools for disease prevention, diagnosis, and treatment <ref type="bibr">[2]</ref>. Researchers can design experiments to characterize the function of a protein (for example, researchers can design an assay to measure the execution of a given molecular function and show if the protein serves as an agent in such executions) <ref type="bibr">[3]</ref>. Knowing the diversity and full space of the protein universe will be helpful, and the number of genomic sequences collected is exponentially growing due to recent advances in sequencing technology <ref type="bibr">[4]</ref>. However, experimental methods can not efficiently annotate protein sequences at large scale.</p><p>Researchers have shown that the knowledge of the biological role of common proteins in one organism can often be transferred to other organisms <ref type="bibr">[5]</ref>. Therefore, the Gene Ontology Consortium proposed to use a dynamic collection of controlled vocabulary to describe the functions of proteins. Such a Gene Ontology (GO) database lays a foundation for computational analysis of large-scale molecular biology and genetics experiments in biomedical research <ref type="bibr">[6]</ref>.</p><p>There are three major branches in GO, namely, Biological Process (BP), Molecular Function (MF), and Cellular Component (CC) to describe the function of proteins from different aspects. GO is hierarchical, "children" terms are more specific functions compared with a "parent" term. In addition, individual terms can have not only multiple descendants, but also multiple parents <ref type="bibr">[7]</ref>. This controlled vocabulary of protein functions has enabled computational methods for functional annotation. And the experimental function information of annotated protein sequences can be used to infer the functions of protein sequences that have not yet been characterized. One of the challenges in GO term prediction is that it's essentially a multi-task learning problem as the model predicts the presence of multiple GO terms simultaneously.</p><p>Recent advances in deep learning have made huge successes in Computer Vision (CV) and Natural Language Processing (NLP) fields. In this paper, we propose a deep learning model to predict the GO terms of protein sequences. We show that such model can outperform baseline methods and can be interpreted by extracting the output of intermediate layers. We show that our model can learn sequential information from input proteins and demonstrate multiple methods for model interpretation and data visualization. The rest of this paper is organized as follows. In Section II, we give an overview of related research. Section III discusses detailed information on our proposed model. Experiments are presented in Section IV, followed by our results and discussion in Section V. In section VI, we draw our conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>One of the most effective ways to computationally determine the functions of an unknown protein is to find the most similar sequence in the reference sequences with experimental functional annotations and use its functions to annotate the query sequence. Similarity comparison programs such as Basic Local Alignment Search Tool (BLAST) <ref type="bibr">[8]</ref> can identify biologically relevant sequence similarities. Many computational methods have been proposed based on sequence similarity <ref type="bibr">[1]</ref>, <ref type="bibr">[9]</ref>. However, the similarity search process involves alignment which is relatively computationally expensive. In addition, the similarity search method doesn't generalize to distantly related sequences. Therefore, the prediction of novel protein sequences can be challenging for similarity based methods <ref type="bibr">[10]</ref>.</p><p>Machine learning based methods, on the other hand, have the advantage of better generalizability to predict the function of remotely relevant proteins and the homologous proteins of distinct functions <ref type="bibr">[11]</ref>. The function prediction performance are further improved by deep learning based models in recent years <ref type="bibr">[2]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>. However, the proposed models mainly use convolutional neural networks (CNN) instead of recurrent neural networks (RNN) which are more suitable to capture sequential order information <ref type="bibr">[14]</ref>. Furthermore, recurrent neural networks, especially long short-term memory (LSTM) network, with an attention mechanism <ref type="bibr">[15]</ref> have demonstrated better performance and model interpretability in not only the NLP field <ref type="bibr">[16]</ref>- <ref type="bibr">[18]</ref> but also the bioinformatics field <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>. Therefore, we believe it is beneficial to incorporate a recurrent neural network with an attention mechanism for protein function prediction.</p><p>The contributions of our work are 3-fold: 1) We develop a deep learning model with attention mechanism that can predict Gene Ontology labels given a protein sequence input; 2) We show this model can produce more accurate predictions by comparing it against other baseline methods; 3) We show how the model can be interpreted by examining and visualizing the intermediate layer output in our deep neural network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. THE PROPOSED MODEL</head><p>Our model consists of several convolutional residual network, a Bi-directional LSTM network with an attention mechanism <ref type="bibr">[15]</ref> and a hierarchical dense layer described in <ref type="bibr">[12]</ref>. The convolutional residual network (ResNet) is proposed to capture local patterns using convolutional filters and avoid vanishing gradients by shortcut connections. Therefore, we place it at the beginning of our model to learn the lower level patterns directly from the input data. Bi-directional LSTM network is then used to learn the high level features and sequential information from the previous step. One of the key component in our proposed model is the feed-forward attention mechanism, which is inspired by previous work in neural language processing field <ref type="bibr">[15]</ref>, <ref type="bibr">[18]</ref>. The previous works have shown that the attention mechanism can enable the model for better interpretability in addition to accurate predictions. Therefore, we add this mechanism to the Bidirectional LSTM layer to get a dense representation of the input sequence. Finally, a hierarchical dense layer described in <ref type="bibr">[12]</ref> is used to perform the final prediction. The advantage of such hierarchical dense layer is that it captures the hierarchical structure of GO and ensures that a GO term will be predicted if one of its "child" terms is predicted. We have applied a deep neural network with similar architecture in 16S rRNA datasets for phenotype prediction and demonstrated that such deep learning architecture can learn informative regions from 16S rRNA reads that are predictive for "Phenotype" <ref type="bibr">[20]</ref>.</p><p>Our proposed model is described in Fig. <ref type="figure">1</ref>. The input is a one-hot coded protein sequence with length T . Since there are 20 different amino acids in our input sequences <ref type="bibr">[21]</ref>, the dimension of the a one-hot coded sequence is a T &#215; 20 dimensional matrix. The input is fed to one 1-dimensional convolutional blocks with window size of W and the number of output channels of N c followed by 4 ResNet layers with the same configuration. The resultant output is a T &#215; N c dimensional matrix which is processed by a Bidirectional LSTM layer with the number of hidden nodes of N h and the output from both directions are concatenated together. Therefore, the hidden states output, H, is a T &#215; 2N h dimensional matrix. The attention layer is a time distributed dense layer with 1 hidden unit, and it assigns a attention score for each position in the input along the sequence axis. The attention scores are activated by a softmax function to generate attention weights so that the weights sum to 1. The attention weights, &#945;, is 1&#215;T dimensional. Then, the attention weights and the hidden states output, H, are used compute a sequence embedding vector, E, as shown in <ref type="bibr">(1)</ref> where</p><p>Finally, a hierarchical dense layer is used to produce the final GO term predictions. Since one sequence can have multiple GO term annotation, we need a prediction layer that can predict different GO terms simultaneously. Here we used the hierarchical dense layer proposed in DeepGO <ref type="bibr">[12]</ref>. In the hierarchical dense layer, each dense node, which outputs a scalar, corresponds to a GO term. We use sigmoid activation function after each dense node and use the binary crossentropy as the cost function.</p><p>IV. EXPERIMENTS</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Baseline Methods</head><p>Kulmanov et al. proposed a novel deep learning model, DeepGO, that predicts protein function from sequence that outperforms the similarity based baseline method, BLAST <ref type="bibr">[12]</ref>. The proposed model can take the protein sequences only as input and predict GO terms (referred as DeepGOSeq in the original paper). In addition, their model can be trained to take both a protein sequence and a protein-protein interaction (PPI) network embedding vector as inputs for GO term prediction. However, in order to retrieve the PPI information of the input sequence, similarity searches are required. In our paper, we focus on developing a model without similarity search steps. Therefore, we choose to compare our method with DeepGOSeq model which only takes a protein sequence as input. BLAST is another baseline method we consider in this paper which relies on sequence similarity comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Dataset and Evaluation Metric</head><p>SwissProt provides manually curated protein sequences with Gene Ontology annotation <ref type="bibr">[22]</ref> which was used by Kulmanov et al. as the experimental dataset which contains 60,710 proteins annotated with 27,760 classes <ref type="bibr">(19,</ref><ref type="bibr">181</ref> in BP, 6221 in MF and 2358 in CC) <ref type="bibr">[12]</ref>. They further filtered out very specific GO terms with only small number of annotations and selected the top 932 terms for BP, 589 terms for MF and 436 terms for the CC ontology to train their models <ref type="bibr">[12]</ref> which results in three sets of training and testing datasets labeled by the three GO term branches. The authors have released the processed dataset 1 . We use their filtered dataset to develop our model and further visualization.</p><p>The performance of our model is evaluated by a protein centric maximum F-measure, F max . It is widely used for Gene Ontology prediction evaluation <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[12]</ref>. We adapted the F-measure computation defined in DeepGO <ref type="bibr">[12]</ref>. The output vectors of both our model and DeepGO model are vectors of decimal values between 0 and 1. To determine the predicted GO terms, a threshold is needed and protein centric maximum F-measure can be used to find the best threshold that maximizes the F-measure of the resultant GO term predictions. To be specific, Kulmanov et al. compute the F max measure using the following formulas:</p><p>1) The precision and recall of an individual sequence, i, can be evaluated by ( <ref type="formula">2</ref>) and ( <ref type="formula">3</ref>) where f is a GO term, P i (t) is a set of predicted GO terms for the protein i determined by a threshold t, and T i is a set of annotated GO terms for the protein i (ground truth). Note that one sequence can be annotated by multiple GO terms, therefore, precision and recall can be calculated per sequence.</p><p>2) The averaged precision and recall can be calculated among testing proteins by ( <ref type="formula">4</ref>) and ( <ref type="formula">5</ref>) where m(t) is the total number of proteins that the model predicts at least one term using the threshold t and n is a number of all testing proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AvgP r(</head><p>3) Finally, the protein centric maximum F-measure, F max , can be calculated by choosing the threshold t that maximizes the averaged F-measure as shown in <ref type="bibr">(6)</ref>.</p><p>1 The DeepGO experimental dataset is available at <ref type="url">https://github.com/</ref> bio-ontology-research-group/deepgo <ref type="bibr">[12]</ref> C. Experimental Setup</p><p>The experimental dataset has been split into a train and test set by Kulmanov et al. We further split the training set into 80% training set and 20% validate set. We train three different models to predict GO terms associated with three main branches (BP, MF, CC) respectively similar to DeepGO models <ref type="bibr">[12]</ref>. The best set of parameters are selected for each model to maximize F max in validation set through a grid search of possible combinations of parameters listed in Table <ref type="table">I</ref>. The best parameters for three models are pretty consistent: N c = 128, W = 3, D = 0.1 and HD = Y es. However, the optimal N h is 64 for both BP and MF model and 128 for CC model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. RESULTS AND DISCUSSION</head><p>The models are trained with the best parameters selected in Section Experimental Setup and compare with the F max values reported in DeepGO paper <ref type="bibr">[12]</ref>. We also implement the DeepGOSeq model ourselves and evaluate our own implementation with the experimental dataset. The performance is shown in Table <ref type="table">II</ref>. From Table <ref type="table">II</ref>, we can see that BLAST has the best performance for BP related GO terms prediction. However, it is not performing well in MF and CC prediction tasks. The performance of the DeepGOSeq model reported in the original paper notably outperforms BLAST in the CC prediction task. However, its performance is slightly worse than BLAST in BP and MF tasks. Our implementation of DeepGOSeq model outperforms BLAST in both MF and CC tasks with a some performance drop in BP. Lastly, our proposed model produces better or comparable Gene Ontology terms prediction performance. It is comparable to BLAST in the BP task and it also stands out in MF and CC prediction tasks compared with DeepGOSeq model. In addition, to produce accurate GO predictions, the intermediate layer output of our model, namely the attention weights, &#945;, and the sequence embedding vector, E, can facilitate data visualization and model interpretation. First, our model can convert a raw amino acid sequence into a meaningful numerical embedded vector, E, that encodes protein function signals. To understand how our model transforms the protein sequences in numerical space, we extract the embedding vectors, H, for all testing sequences and reduce the dimensionality to 2-dimension using t-SNE <ref type="bibr">[23]</ref> for visualization. Fig. <ref type="figure">2</ref> and<ref type="figure">3</ref> show the testing sequence embedding vectors generated by MF and CC models respectively. We can observe that there are several clusters which might correspond to biologically meaningful functions. To interpret the clusters, we perform k-means <ref type="bibr">[24]</ref>, with k = 5 clusters, to assign cluster labels to sequences. Then we find the GO term that dominates each clusters and color code all testing sequences with that dominant GO term with a distinctive color. Sequences that are not associated with the selected dominant GO terms are color coded by gray color. In Figs. <ref type="figure">2</ref> and<ref type="figure">3</ref>, each point is a sequence and colored by a color that corresponds to a dominant GO term. From these figures, we can see that the model learns the amino acid sequences information and can embed sequences associated with the same functions closely. To be specific, Fig.  We further show that some sequences have both catalytic activity and binding labels (in green) which contributes to the overlap between the two clusters. Sequences with transporter activity form their own cluster (in red) and sequences with one specific catalytic activity related GO term (transferase activity, transferring phosphorus-containing groups) form the purple cluster. In Fig. <ref type="figure">3</ref>, clusters are formed based on Organelle. Proteins in the Mitochondria form a small cluster (in green). There is a symbiogenesis theory hypothesizing that all mitochondria derive from a common ancestral organelle that originated from the integration of an endosymbiotic alphaproteobacterium into a host cell related to Asgard Archaea <ref type="bibr">[25]</ref>. Our proposed model picks up this information by embedding protein sequences in Mitochondria together in the embedding space. Membrane associated sequences, on the other hand, are more widely spread which implies that the membrane related protein sequences also participate in other functions. We further explore the attention weights extracted from our model. We focus on three sodC gene sequences from three different fungi, namely, Aspergillus fumigatus Af293 (SODC ASPFU), Aspergillus niger CBS 513.88 (SODC ASPNC) and Candida glabrata CBS 138 (SODC CANGA). According to Uniprot <ref type="bibr">[22]</ref>, sodC gene can destroy toxic radicals which are normally produced within the cells. Fig. <ref type="figure">4</ref> shows the attention weights for these three sequences. The protein sequence SODC ASPFU (in blue) and SODC ASPNC (in orange) are pretty similar (15 amino acid differences out of 154 amino acids) whereas SODC CANGA (in green) sequence has more amino acid variations compared with the other two sequences. In the figure, we also label the amino acid for those three sequences at high attention positions (from left to right: SODC ASPFU, SODC ASPNC and SODC CANGA). From the figure, we can see that similar sequences have similar attention weights whereas different amino acid configurations can result in different attention weights. For example, in position 116, both SODC ASPFU (in blue) and SODC ASPNC (in orange) have high attention weights while the model doesn't pay high attention to that position in SODC CANGA (in green) sequence. And only the SODC CANGA (in green) has an S at that position which is different from the amino acid T shared by SODC ASPFU (in blue) and SODC ASPNC (in orange).</p><p>VI. CONCLUSION This paper presents a deep learning based Gene Ontology prediction model that combines convolutional neural network and recurrent neural network with feed-forward attention mechanism. We evaluate our new approach using an experimental dataset and demonstrate that our method performs comparably or outperforms the baseline methods in different GO branch prediction tasks. We then show that the output of intermediate layer can be used to interpret the model. We demonstrate that the sequence with the same Gene Ontology annotation are clustered together in the embedding space by the model. Furthermore, with the help of attention weights produced by the model, users can explore the sequential information space to identify predictive regions in the sequences. As a future direction, we suggest to predict other protein labels such as protein families <ref type="bibr">[26]</ref> with this framework.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: Drexel University. Downloaded on August 31,2021 at 17:35:20 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
