<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Accurately predicting anticancer peptide using an ensemble of heterogeneously trained classifiers</title></titleStmt>
			<publicationStmt>
				<publisher>Elsevier</publisher>
				<date>01/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10533995</idno>
					<idno type="doi">10.1016/j.imu.2023.101348</idno>
					<title level='j'>Informatics in Medicine Unlocked</title>
<idno>2352-9148</idno>
<biblScope unit="volume">42</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Sayed Mehedi Azim</author><author>Noor_Hossain Nuri Sabab</author><author>Iman Noshadi</author><author>Hamid Alinejad-Rokny</author><author>Alok Sharma</author><author>Swakkhar Shatabda</author><author>Iman Dehzangi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The use of therapeutic peptides for the treatment of cancer has received tremendous attention in recent years. Anticancer peptides (ACPs) are considered new anticancer drugs which have several advantages over chemistry-based drugs including high specificity, strong tumor penetration capacity, and low toxicity level for normal cells. Due to the rise of experimentally verified bioactive peptides, several in silico approaches became imperative for the investigation of the characteristics of ACPs. In this paper, we proposed a new machine learning tool named iACP-RF that uses a combination of several sequence-based features and an ensemble of three heterogeneously trained Random Forest classifiers to accurately predict anticancer peptides. Experimental results show that our proposed model achieves an accuracy of 75.9% which outperforms other state-of-the-art methods by a significant margin. We also achieve 0.52, 75.6%, and 76.2% in terms of Matthews Correlation Coefficient (MCC), Sensitivity, and Specificity, respectively. iACP-RF as a standalone tool and its source code are publicly available at: https://github.com/MLBC-lab/iACP-RF.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Cancer is considered as a genetic disease since it is developed due to changes in genes that control cell function, especially how they grow and divide. According to the World Health Organization (WHO), in 2020 alone, 10 million people died prematurely due to cancer worldwide, which accounts for nearly one in six deaths. Due to the inadequacy of accurate and non-invasive markers, the detection of cancer is usually biased and not always correct <ref type="bibr">[1,</ref><ref type="bibr">2]</ref>. Advancements in the field of proteomics and genomics have recently led to the discovery of peptide-based biomarkers, which have enhanced the detection of cancer at its early stage <ref type="bibr">[3]</ref>. After diagnosing cancer, the next step is its treatment.</p><p>As of yet, chemotherapy, radiation therapy, hormonal therapy, and surgery are the conventional treatments available for treating cancer, the Federal Drug Administration (FDA) and more than 500 are under clinical trials <ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref>.</p><p>The term anticancer peptides (ACPs) refer to small peptides that exert selective and toxic properties toward cancer cells and represent a promising class of therapeutic agents as synthetic peptide-based drugs and vaccines due to their inherent high penetration and selectivity, as well as ease of modification. It was shown that affinity, stability, and selectivity for the elimination of cancer cells can be improved by designing ACPs <ref type="bibr">[18]</ref>. Amino acid residues influence the anticancer properties by relying on cationic, hydrophobic, and amphiphilic properties with helical structures to drive cell permeability. Cationic amino acid residues like lysine, arginine, and histidine can particularly penetrate and disrupt cancer cell membranes to induce cytotoxicity. On the other hand, anionic amino acids like glutamic and aspartic acids provide antiproliferative activity against cancer cells <ref type="bibr">[19]</ref>. Hydrophobic amino residues like phenylalanine, tryptophan, and tyrosine also exert their cytotoxic activities <ref type="bibr">[20]</ref>. Also, cationic and hydrophobic amino acids that form the secondary structure of ACPs, plays a vital role in peptide-cancer cell membrane interaction that leads to cancer cell disruption and necrosis <ref type="bibr">[21]</ref>.</p><p>ACPs are small peptides (5-50 amino acids) and are cationic by nature. In general, they possess mostly &#120572;-helix based secondary structures (e.g. LL37, BMAP-27, BMAP-28, and Cecropin A). Some also fold into &#120573;sheet (e.g. Lactoferrin, Defensins, etc.) and demonstrate extended linear structure like Tritrpticin, and Indolicidin <ref type="bibr">[22,</ref><ref type="bibr">23]</ref>. Cancer cells display different properties in contrast to normal cells and possess a larger surface area due to the presence of a higher number of microvilli, negatively charged cell membrane, and higher fluidity of the membrane. Mitochondrial membrane lysis (apoptosis) is another means for ACPs to exhibit their function, recruiting other immune cells, or inhibiting angiogenesis pathway for attacking cancer cells and activating essential proteins which ultimately lyse cancer cells <ref type="bibr">[24]</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref>.</p><p>Accurate prediction of ACPs is essential to explore the novel therapeutic ACPs mechanism of action and development. Experimental processes to conduct different tasks in biology are time-consuming, laborintensive, and expensive. Hence, there is a demand for developing fast and accurate computational tools. Many machine learning-based models have been developed to tackle different biological problems. Studies for predicting miRNA-disease associations have also benefited from computational approaches like machine learning by outperforming existing works <ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref>. Previously, various sequence-based computational methods were proposed for the prediction of ACPs. Among them, the most notables are AntiCP, iACP, ACPP, iACP-GAEnsC, MLACP, SAP, TargetACP, ACPred, ACP-DL, ACPred-FL, PTPD, Hajisharifi et al.'s method, Li and Wang's method, ACPred-Fuse and PEPred-Suite <ref type="bibr">[29,</ref><ref type="bibr">30,</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref><ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref><ref type="bibr">[40]</ref><ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref>.</p><p>In one of the early studies, Tyagi et al. <ref type="bibr">[32]</ref> used Support Vector Machine (SVM) to predict ACPs. Although they reported high specificity, their result was poor in terms of sensitivity. To develop iACP, Chen et al. used rigorous cross-validation by optimizing the g-gap dipeptide components to predict ACPs <ref type="bibr">[33]</ref>, whereas Akbar et al. used genetic algorithm-based ensemble classifiers to tackle this problem <ref type="bibr">[35]</ref>. Later on, Manavalan et al. investigated the performance of SVM compared to the Random Forest (RF) classifier on Tyagi-B dataset. They showed that RF demonstrates better results compared to SVM for this problem <ref type="bibr">[29]</ref>. In a similar study, Schaduangrat et al. used SVM and RF together to tackle this problem and achieved promising results <ref type="bibr">[16]</ref>. Akbar et al. introduced cACP for ACP prediction, applying features like Quasisequence order, conjoint triad, and Geary autocorrelation descriptor, along with traditional ML methods such as SVM, RF, and KNN. They also utilized SVM for developing cACP-2LFS to predict ACPs and later proposed cACP-DeepGram, a Deep Neural Network approach using word embedding features, for accurate ACP classification <ref type="bibr">[44]</ref><ref type="bibr">[45]</ref><ref type="bibr">[46]</ref>.</p><p>More recently, Yi et al. used a long short-term memory (LSTM) model to predict ACPs and demonstrated promising results. In a different study, Wei et al. proposed a new adaptive feature representation strategy that learns the most representative features for different peptide types and used RF as their classification technique to solve this problem <ref type="bibr">[30]</ref>. Although they were able to predict different peptide types simultaneously, their results were not satisfactory compared to other methods. More recently, Rao et al. fused the class and probabilistic features with handcrafted sequential features and showed that combinations of diverse and heterogeneous features have a more discriminative ability to predict ACPs <ref type="bibr">[43]</ref>. The comprehensive review of computational approaches proposed to predict ACPs is presented in <ref type="bibr">[16]</ref>.</p><p>Despite all the efforts, the ACP prediction performance still remains limited. The main challenge of existing work is their limited ability in accurately classifying the ACPs. Although the existing works show high accuracy, they lack behind in terms of sensitivity. In addition, ensemble machine learning models combined with heterogeneous sets of features have not been explored adequately to tackle this problem. Although sequence-based feature extraction techniques like K-mer, k-Gapped Kmer, and Binary Profile features showed promising results as standalone techniques, there has not been any study to combine all three feature groups together for the prediction of ACPs.</p><p>To mitigate this gap, we propose a new ensemble of heterogeneously trained classifiers called iACP-RF to accurately predict anticancer peptides. To build this model, we use three effective feature extraction techniques namely K-mer, Binary profile feature, and k-Gapped K-mer. We utilize two variants of Gapped K-mer, which are 1-Gapped Di-Mono, and 1-Gapped Mono-Di. We then feed these three feature sets into three different Random Forest (RF) classifiers and then use majority voting to combine them and predict anticancer peptides. iACP-RF, as an ensemble of heterogeneous RF classifiers that are trained using different sets of features, demonstrates better results compared to the stateof-the-art methods found in the literature for predicting anticancer peptides. The key contributions of this research are as follows:</p><p>&#8226; Proposing a novel architecture to classify anticancer peptides (ACP).</p><p>&#8226; Outperforming existing models for predicting ACPs. &#8226; Proposing a new ensemble of heterogeneous classifiers using Random Forest as the base classifier. &#8226; Investigating different sets of attributes for feature extraction to build our proposed machine learning model iACP-RF. &#8226; Building our model as a standalone tool namely iACP-RF, which is publicly available at <ref type="url">https://github.com/MLBC-lab/iACP-RF</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Materials and methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Dataset</head><p>Here we utilize a dataset obtained from iACP-FSCM study which is available at: iACP-FSCM. They accumulated the datasets used in their previous studies of anticancer peptides such as ACP-DL, ACPP, ACPred-FL, AntiCP, and iACP to build their own dataset <ref type="bibr">[33,</ref><ref type="bibr">34,</ref><ref type="bibr">38,</ref><ref type="bibr">39,</ref><ref type="bibr">47,</ref><ref type="bibr">48]</ref>. This dataset contains ACPs with a length between 4 and 50 residues. They divided the dataset into two, namely main and alternate datasets, and divided both further into train and independent test sets. In each case, we train our model on the main and alternate datasets and test our model on their corresponding independent test sets.</p><p>The alternate dataset consists of 776 experimentally validated positive peptides and 776 negative peptides as training sets. It also contains 194 positive peptides and 194 negative peptides in the validation set used to verify the model's performance. The main dataset, contains ACPs with both anticancer and antimicrobial properties. This dataset consists of 689 positive anticancer peptides and 689 negative peptides. It also contains 172 positive peptides and 172 negative peptides in its validation set. Using these datasets enables us to directly compare our results with the state-of-the-art models found in the literature. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Feature encoding</head><p>Extracting features is a major aspect of any research in order to implement an ML model. To accurately distinguish ACPs from non-ACPs and to develop an effective computational tool, extracting informative features with significant discriminatory information to present peptide sequences is crucial. In this paper, we utilized sequence-based k-mer composition and gapped k-mer composition for representing the sequences. These features are explained in more detail in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">K-mer composition</head><p>K-mer is all the possible consecutive subsequences of length k obtained from peptide sequences, which denote the number of times each combination of k-mer exists in the sequence. With a sequence of size n, the number of k-mer possibilities is &#119899; -&#119896; + 1. To figure out the k-mer composition, the frequency of each k-mer of a particular sequence is calculated and then divided by the whole sequence length to normalize the result. This can be formulated as:</p><p>where n denotes the summation of nucleotides in the sequence, s represents a k-mer with a length of k, and peptide[i:i+k-1] denotes the substring of k peptides starting from the i index. The function match can be presented as the following formula:</p><p>For instance, let us consider a peptide sequence consisting of twenty amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y.</p><p>For the value of k = 1, we get 20 k-mers {'A','C','D','E','F','G','H','I','K', 'L','M','N','P','Q', 'R','S','T','V','W','Y'}, and using the formula (</p><p>1) we can represent the sequence ''ACDEFGHIKLMNPQRSTVWY'' as a feature vector [1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20, 1/20].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2.">K-gapped Mono-Di and Di-Mono composition</head><p>A full k-mer refers to a letter subsequence of length k. For example, AAGT is a full 4-mer. By contrast, a k-gapped Mono-Di refers to a subsequence containing three letters with k-number of gaps after one amino acid, whereas Di-Mono refers to a subsequence containing three letters with k-number of gaps after two amino acids. The normalized frequency of 3-mers with a single gap between them are used to calculate these features. X_XX is the form for 1-gapped Mono-Di where X is the amino acids A, <ref type="figure">C</ref>, <ref type="figure">D</ref>, <ref type="figure">E</ref>, <ref type="figure">F</ref>, <ref type="figure">G</ref>, <ref type="figure">H</ref>, <ref type="figure">I</ref>, <ref type="figure">K</ref>, <ref type="figure">L</ref>, <ref type="figure">M</ref>, <ref type="figure">N</ref>, <ref type="figure">P</ref>, <ref type="figure">Q</ref>, <ref type="figure">R</ref>, <ref type="figure">S</ref>, <ref type="figure">T</ref>, <ref type="figure">V</ref>, <ref type="figure">W</ref>, <ref type="figure">Y</ref>, and 1-gapped Di-Mono is in the form of XX_X.</p><p>When k-gap is equal to n, then 20 &#119909; (20 &#119909; 20) &#119909; n features will be generated for protein sequences. For 1-gapped Mono-Di having k-gap = 1, a total of 20 &#119909; (20 &#119909; 20) &#119909; 1 = 8000 features is extracted, and the features are the number of A_AA, A_AC, A_AD, A_AE, A_AF, A_AG, A_AI, A_AK, A_AL, A_AM, A_AN, A_AP, A_AQ, A_AR, A_AS, A_AT, A_AV, A_AW, A_AY, A_CA, A_CC, . . . , Y_YA, Y_YC, Y_YD, Y_YE, Y_YF, Y_YG, Y_YH, Y_YI, Y_YK, Y_YL, Y_YM, Y_YN, Y_YP, Y_YQ, Y_YR, Y_YS, Y_YT, Y_YV, Y_YW, and Y_YY that are present the whole peptide sequence.</p><p>For 1-gapped Di-Mono having k-gap = 1, a total of 20 &#119909; (20 &#119909; 20) &#119909; 1 = 8000 features is extracted, and the features are the number of AA_A, AA_C, AA_D, AA_E, AA_F, AA_G, AA_H, AA_I, AA_K, AA_L, AA_M, AA_N, AA_P, AA_Q, AA_R, AA_S, AA_T, AA_V, AA_W, AA_Y, . . . , YY_A, YY_C, YY_D, YY_E, YY_F, YY_G, YY_H, YY_I, YY_K, YY_L, YY_M, YY_N, YY_P, YY_Q, YY_R, YY_S, YY_T, YY_V, YY_W, and YY_Y that are present in the whole peptide sequence.</p><p>We extracted these features using PyFeat, a toolkit implemented in Python for extracting various features from proteins, DNAs, and RNAs <ref type="bibr">[49]</ref>. Table <ref type="table">1</ref> shows the depiction of the features extracted using the employed feature extraction technique.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3.">Binary profile feature</head><p>Binary profile feature (BPF) is a straightforward technique yet proves to be quite effective in the prediction of different functionalities from multi-omics data <ref type="bibr">[38,</ref><ref type="bibr">50]</ref>. We generate Binary profiles for each peptide, by representing each amino acid as a vector of 20 dimensions in terms of one hot encoding. For instance, Cytosine can be written as a 20-size one hot vector which is [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. A sequence of length M can be represented by a vector of dimensions M &#215; 20. As the maximum size of peptide sequences in our datasets is 50 residues, we get 50 &#215; 20, i.e., 1000 features for each peptide sequence. We padded the peptides that are shorter than 50 amino acids with dummy amino acid ''X''. We encoded this dummy amino acid with the [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] vector. In this way, we make sure that the lengths of sequences are equal and no redundant information is added.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Classifier</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">Random forest</head><p>Random forest is a meta-classifier that consists of a number of decision tree classifiers (referred to as base learners) trained on various sub-samples of training data that are generated based on the concept of bagging to solve regression and classification. It was first proposed in <ref type="bibr">[51]</ref>. In bagging, the available training data is randomly subsampled through a technique called baggining to generate different subsamples from the original data. Random Forest estimates the outcome based on averaging the predictions of its base learners. RF has been widely used in similar studies and obtained promising results <ref type="bibr">[16,</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">Ensemble classifier</head><p>Machine learning has been widely used to tackle different problems in biological science including genomics, proteomics, microarrays, systems biology, evolution, and text mining <ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref><ref type="bibr">[55]</ref>. Among different ML approaches, ensemble classifiers are considered among the most effective ones. Ensemble learning is a concept to train multiple classifiers and combine their predictions as a single classifier. In general, it is expected that the output of the ensemble classifier to be better compared to any of its ensemble members with uncorrelated error on the target data sets <ref type="bibr">[56]</ref>. Ensemble models were originally designed to reduce the variance which results in the improvement of the performance. Where variance indicates the performance change of a model when it fits with a different set of data. An ideal machine learning model is considered to have low variance and low bias and these two are affected by one another. From previous studies it is evident that some ensemble techniques reduce the error of both bias and variance parts, consequently, improving prediction performance <ref type="bibr">[57,</ref><ref type="bibr">58]</ref>. Ensemble classifiers have been shown effective in enhancing prediction performance for different problems in bioinformatics as well <ref type="bibr">[59]</ref><ref type="bibr">[60]</ref><ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref><ref type="bibr">[64]</ref>.</p><p>To predict the ACP sites with precision, various types of models have been used. We investigate the effectiveness of several classifiers using single feature extraction methods (k-mer, Binary Profile Feature, and k-gapped Mono-Di and Di-Mono features, separately) to predict ACPs. We observed satisfactory results in some cases, with better sensitivity or specificity as shown in Tables <ref type="table">2</ref><ref type="table">3</ref><ref type="table">4</ref><ref type="table">5</ref>. However, the achieved results are biased toward negative samples, which shows that a combination of models is the next best approach to consider.</p><p>In this paper, we use an ensemble of three Random Forest (RF) classifiers which are trained heterogeneously using different feature sets to predict the ACPs. We aggregate the final output of these classifiers using majority voting. We extract MonoMer, DiMer, TriMer and feed them to the first RF model, Binary profile feature into the second RF, and a combination of K-mers, 1-gapped Mono-Di, and 1-gapped Di-Mono to the third RF as input feature vectors. To build our model, we have also studied several popular classification techniques such as Linear Regression (LR), K-Nearest Neighbor (KNN), Na&#239;ve Bayes (NB), Decision Tree (DT), and Support Vector Machines (SVM) which are widely used for similar problems and attained promising results <ref type="bibr">[65]</ref>. However, among all these classifiers, an ensemble of RF classifiers attained the best results and significantly outperformed other classifiers. We have investigated a different number of base learners for our employed RF classifiers. Of all the variations of base learners, using 150, 250, and 400 for our three RF models, we obtained the best results. Since our employed dataset is considerably small, we experimented with unpruned trees where the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples, which is two in our experiment. The remaining hyperparameters were kept as default, which are: criterion ='gini', max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = 'sqrt', max_leaf_nodes = None, min_impurity_decrease = 0.0, bootstrap = True, oob_score = False, n_jobs=None, random_state = None, verbose = 0, warm_start = False, class_weight = None, ccp_alpha = 0.0, max_samples = None. The general architecture of iACP-RF is shown in Fig. <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Result analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Evaluation metrics</head><p>Evaluating the performance of a prediction method is crucial to find its reliability and generality with respect to the experimental dataset. To evaluate the performance of our model and to compare our results with previous studies, we use different measurements including sensitivity (Sn), specificity (Sp), accuracy (Ac), and Matthews correlation coefficient (MCC). These measurements are calculated using the following equations:</p><p>Where tp represents the total true positive predictions, tn represents the total true negative predictions, fp represents the false positive predictions, and fn represents the false negative predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Comparison with different ML approaches to build our ensemble classifier</head><p>First, to build our ensemble model, we compare different machine learning classifiers including SVM, LR, DT, NB, KNN, and RF. We further studied several other classifiers including the Adaboost (AB), Rotation Forest (RoF), and Gradient boosting trees (GBT). The results achieved for the best combinations of different classifiers are presented in Table <ref type="table">6</ref>. As shown in the table, using an ensemble of heterogeneous RF classifiers, we obtain the best results. Hence, we use this classifier to build iACP-RF.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Investigating the impact of each classifier used to build iACP-RF</head><p>To investigate the performance of our proposed model, it was first crucial to find out how the individual Random Forest models perform using the given feature sets. The separate models showed impressive results using different types of features in terms of accuracy, sensitivity, specificity, and MCC scores, as shown in Table <ref type="table">7</ref>. However, the models are not consistent with their performance as the table depicts. Also, none of the individual RF models obtain better results than the combination of all three classifiers.</p><p>As shown in Table <ref type="table">7</ref>, RF trained on K-mers shows consistent results. However, it comes short of specificity on the alternate dataset. Whereas RF trained on K-gapped di-Mono achieves an underwhelming result with respect to specificity on both main and alternate datasets despite achieving outstanding results in terms of sensitivity. They scored 83.7% and 95.4% in both the main and alternate datasets, respectively which was an increase of 11.1% and 7.8% in sensitivity compared to iACP-FSCM which is considered the state-of-the-art ACP predictor. Similarly, RF trained using K-gapped Mono-Di is also capable of achieving high true positives rate of 82.0% and 96.4%, respectively on the main and alternate datasets, which are an increase of 9.4% and 8.8%, respectively compared to iACP-FSCM. However, it achieves poor specificity scores. Promising results used for each individual classifier demonstrate the effectiveness of our proposed features and employed classifiers to tackle this problem. Using K-mer and Gapped K-mer (24420 feature) achieves an increase of 2.4% on the main dataset in terms of sensitivity compared to iACP-FSCM. In addition, using Binary Profile Feature (1000 feature) we obtained an increase of 6.5% in sensitivity on the main dataset compared to iACP-FSCM.</p><p>By studying different feature sets for separate models, we demonstrate that single independent models are not able to achieve consistent results in terms of all the metrics used in this study for evaluation measurements. By using an ensemble of heterogeneously trained Random Forest methods, we achieve consistent performance for both the main and alternate datasets, with 75.6% and 89.2% in terms of sensitivity and 76.2% and 96.9% in terms of specificity, respectively. Our MCC scores also outperform iACP-FSCM by 0.04, 0.03, 0.06, and 0.09 respectively as shown in Table <ref type="table">8</ref>.</p><p>These results show that not only our proposed features and employed classifiers are able to achieve promising results to tackle this problem, but also our proposed ensemble of heterogeneously trained classifiers can enhance the prediction performance with respect to all metrics reported in this study compared to previous studies found in the literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Comparison with other state-of-the-art approaches</head><p>We then compare the results achieved by iACP-RF on both main and alternative datasets to the state-of-the-art methods found in the literature to predict anticancer peptides. The results for this comparison are presented in Table <ref type="table">9</ref>. As shown in this table, iACP-RF significantly outperforms iACP-FSCM which is the most recent and accurate ACP predictor on the alternative dataset. iACP-RF achieves promising results especially in terms of sensitivity on the main dataset compared to iACP-FSCM. iACP-RF demonstrates 4.2%, 1.6%, 6.7%, and 0.09 enhancements in terms of accuracy, sensitivity, specificity, and MCC, respectively over iACP-FSCM on the alternative dataset.</p><p>Although, in general, iACP-FSCM demonstrates better results on the main dataset compared to our proposed model, as shown in Table <ref type="table">9</ref>, iACP-RF achieved 75.6% in terms of Sensitivity compared to 72.6% for iACP-FSCM. It shows that our model is better than determining actual ACP sites. Considering that the main aim of this study is to have better performance in predicting positive samples, iACP-RF can be considered a model with better precision. Our Receiver operating characteristic (ROC) curves in Fig. <ref type="figure">3</ref> show that our model predicts the positive instances, with the Area Under the Curve (AUC) of 0.85 on the main dataset, and 0.96 on the alternate dataset. Fig. <ref type="figure">2</ref> shows the confusion matrix for the testing data. As shown in this figure, iACP-RF can be recognized as a model of good precision.</p><p>Note that although iACP and ACPred achieve better sensitivity than our model, they perform very poorly on negative samples which in turn, results in low specificity and consequently, very poor MCC. This result is mainly related to the dataset that they used to build their model and how they trained with a significant bias toward positive samples. In general, considering the significantly better MCC for our model compared to these two models, we can infer that iACP-RF is more accurate than these two models for predicting ACPs. Although ACPred-Fuse showed their performance to exceed the other existing models <ref type="bibr">[43]</ref>, we are able to outperform their result. iACP-RF also outperforms ACPred-Fuse in all the metrics for alternate and main datasets by a significant margin.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Performance of proposed model on external dataset</head><p>To further investigate the effectiveness of our proposed method for predicting ACPs, we tested our model on three external datasets used in recent studies <ref type="bibr">[66]</ref><ref type="bibr">[67]</ref><ref type="bibr">[68]</ref>. Table <ref type="table">10</ref> shows the experimental results using the datasets collected from various studies. Our proposed method shows stable prediction performance in all the datasets including the main and alternate datasets used in this study using 5-fold cross-validation.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>In recent years, peptide-based therapy has emerged as a novel and promising strategy for the treatment of cancer. It has several advantages like high target specificity, low toxicity, good efficacy, easily synthesized and modified, and less immunogenic when combined with recombinant antibodies compared to conventional approaches. As it is challenging to discover ACP from protein sequence data using experimental methods, which emphasizes on the rapid advancement of computational methods due to its efficient nature.</p><p>In this paper, we proposed a novel prediction method named iACP-RF to accurately predict anticancer peptides. Our model demonstrates  Although the performance of individual features showed promising results with single classifiers, the results were imbalanced. However, empirical studies show that using ensemble models reduces both bias and variance to improve prediction accuracy. Thus, we experimented using a combination of several sequence-based features namely K-mer, BPF, 1-Gapped Di-Mono, and 1-Gapped Mono-Di, and achieved better outcomes compared to existing methods. Among different combinations of features being studied to build our model, using K-mer, BPF, K-mer + 1-Gapped Di-Mono + 1-Gapped Mono-Di respectively, feeding into our heterogeneously trained base classifiers RF1, RF2, and RF3 models combined using majority voting, the best performance was achieved.</p><p>Even so, a critical challenge in the machine learning pipeline when working with a small amount of data is that the model can overfit on the training data and be biased toward the dominant class. In this study, we used a balanced dataset consisting of the same number of samples in the positive and negative classes, which helps in getting a balanced prediction for both classes. Testing the model's performance using two independent datasets and three external datasets, along with a 5-fold CV with high and consistent performance, proves the model is performing in an optimal manner avoiding overfitting.</p><p>Despite the merits of our proposed method, it has several limitations. First, tuning the parameters to get optimal performance requires more data. Since the dataset we worked with contains a handful of samples, tuning the parameters optimally was not feasible. Second, finding the optimal number of classifiers to ensemble is critical and there is no conventional way to find the optimal number of base learners. Finally, the commonly used evaluation metrics used to evaluate the performance of binary classifiers can be too specific. To address these limitations and mitigate these issues we aim to build an explainable machine learning pipeline in the future for predicting anticancer peptides.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>Anticancer peptides play a crucial role in the study of anticancer drugs and the treatment of cancer. Targeting cancer cells is essential in the treatment of cancer. However, a lack of ''guiding missiles'' to target such cells leads to less effective treatment progress. Peptide properties can be used both in molecularly targeted drugs and 'guiding missiles' to inhibit cell proliferation or eradicate cancer cells completely. In this paper, we proposed an ensemble of heterogeneously trained Random Forest models for predicting ACPs using a combination of several sequence-based features namely K-mer, Binary profile feature, 1-Gapped Di-Mono, and 1-Gapped Mono-Di. iACP-RF tool outperforms existing methods by a significant margin for all the metrics in the alternate dataset and shows an enhancement of 3% in terms of sensitivity for the main dataset. On the alternate dataset, we outperform iACP-FSCM in all counts of accuracy, sensitivity, specificity, and MCC score by a margin of 5.5%, 1.6%, 6.7%, and 0.09, respectively. Our results demonstrate the effectiveness of iACP-RF in predicting anticancer peptides compared to previously proposed models found in the literature. iACP-RF as a standalone predictor and all its source code are publicly available at: <ref type="url">https://github.com/MLBC-lab/iACP-RF</ref>.</p></div></body>
		</text>
</TEI>
