<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DHUpredET: A comparative computational approach for identification of dihydrouridine modification sites in RNA sequence</title></titleStmt>
			<publicationStmt>
				<publisher>Elsevier</publisher>
				<date>07/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10616039</idno>
					<idno type="doi">10.1016/j.ab.2025.115828</idno>
					<title level='j'>Analytical Biochemistry</title>
<idno>0003-2697</idno>
<biblScope unit="volume">702</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Md Fahim Sultan</author><author>Tasmin Karim</author><author>Md Shazzad Hossain_Shaon</author><author>Sayed Mehedi Azim</author><author>Iman Dehzangi</author><author>Mst Shapna Akter</author><author>Sobhy M Ibrahim</author><author>Md Mamun Ali</author><author>Kawsar Ahmed</author><author>Francis M Bui</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Laboratory-based detection of D sites is laborious and expensive. In this study, we developed effective machine learning models employing efficient feature encoding methods to identify D sites. Initially, we explored various state-of-the-art feature encoding approaches and 30 machine learning techniques for each and selected the top eight models based on their independent testing and cross-validation outcomes. Finally, we developed DHU-predET using the extra tree classifier methods for predicting DHU sites. The DHUpredET model demonstrated balanced performance across all evaluation criteria, outperforming state-of-the-art models by 8 % and 14 % in terms of accuracy and sensitivity, respectively, on an independent test set. Further analysis revealed that the model achieved higher accuracy with position-specific two nucleotide (PS2) features, leading us to conclude that PS2 features are the best suited for the DHUpredET model. Therefore, our proposed model emerges as the most favorite choice for predicting D sites. In addition, we conducted an in-depth analysis of local features and identified a particularly significant attribute with a feature score of 0.035 for PS2_299 attributes. This tool holds immense promise as an advantageous instrument for accelerating the discovery of D modification sites, which contributes too many targeting therapeutic and understanding RNA structure.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>RNA modification refers to a series of chemical changes made to initial ribonucleic acid (RNA) transcripts, which result in the synthesis of mature RNA molecules that are essential for various biological activities <ref type="bibr">[1]</ref>. At present, more than 300 types of RNA modifications are discovered <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref>. Furthermore, the environment of RNA modification is strikingly similar across all three domains of life: eukaryotes, bacteria, and archaea. Despite their great evolutionary deficiencies, these domains contain numerous alterations and modifying enzymes <ref type="bibr">[1,</ref><ref type="bibr">6,</ref><ref type="bibr">7]</ref>. Dihydrouridine (DHU), denoted by the letter "D" in RNA sequences and having the chemical formula C 9 H 14 N 2 O 6 , is a significant alteration identified in numerous forms of RNA, including transfer RNA (tRNA), messenger RNA (mRNA), and small nucleolar RNA (snoRNA) in different organisms. This alteration has sparked enthusiasm because of its prevalent presence and probable functional importance in RNA molecules <ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref>. The dihydrouridine biosynthesis molecules catalyze the reduction of uridine's (U) C5-C6 double bond, leading to the formation of dihydrouridine <ref type="bibr">[6]</ref>. This alteration can be observed at the variable position of the anticodon helix in tRNA molecules and poses Fig. <ref type="figure">2</ref>. Workflows of the current study. Dataset collection from the previous studies, feature extraction approach of four kinds of descriptors with nine features encoding methods, application of the various models, and selection process, the overall comparison of the models, and selecting the optimal feature and best-fit models for the study.</p><p>major health consequences, including the development of lung cancer, Alzheimer's disease, and Huntington's disorders <ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref>. Therefore, the identification of D sites is crucial for understanding RNA molecules' structure, function, and regulatory activities, as well as their prospective uses in diagnostics and therapies <ref type="bibr">[14]</ref>.</p><p>Experimental methods for recognizing D sites, such as mass spectrometry or antibody-based assays, can be complicated, timeconsuming, and expensive. On the other hand, computational approaches provide a high-throughput capabilities alternative, facilitating experts to evaluate massive RNA sequence databases rapidly and effectively. Advanced computational models, especially using machine learning, can achieve high levels of accuracy in predicting D sites. They can detect subtle patterns and features in RNA sequences that might be missed by traditional methods.</p><p>During the past few years, several machine learning-based methods proposed to predict D sites. Fig. <ref type="figure">1</ref> presents the fishbone diagram of the existing machine learning methods proposed to predict D sites. In 2019, Xu et al. proposed an iRNAD framework based on a support vector machine (SVM) model with various features and achieved promising results <ref type="bibr">[14]</ref>. However, the authors used jackknife cross-validation (CV) regarding spite of its shortcomings. One such drawback is that, especially with small sample sizes, jackknife CV tends to produce biased results. Cross-validation might have been a more suitable method given the author's small dataset. Later, Dou et al. developed iRNAD_XGBOOST with extreme gradient boosting (XGB) on different feature selection approaches <ref type="bibr">[15]</ref>. However, employing a versatile feature extraction approach is more advantageous than relying on a small set of feature extraction methods. In 2022, Zhu et al. introduced a Random Forest (RF) based machine learning approach for the D sites identification <ref type="bibr">[16]</ref>. The authors used an oversampling method, which can introduce bias because it artificially inflates the minority class, potentially leading to overfitting and poor generalization to unseen data. At the same time, Suleman et al. proposed the DHU-pred method with RF model <ref type="bibr">[17]</ref>. The small dimensions of hidden layers might lead to underfitting, particularly in sequences with intricate patterns or those that need to preserve previous information. Later, the iDHU-Ensem method was proposed, where Suleman et al. included artificial neural network (ANN), K-nearest neighbor (KNN), SVM, and decision tree (DT), and obtained a better result <ref type="bibr">[18]</ref>. At the same time, Yu et al. developed a D-pred model with a local self-attention layer and a convolutional neural network (CNN) <ref type="bibr">[9]</ref>. Self-attention techniques, particularly local self-attention, might prove computationally expensive, especially when dealing with complex sequences. Most recently, Roshid et al. developed stacking-based machine learning models called Stack-DHUpred, where the authors used probabilistic values from the baseline models and obtained a better outcome compared to their previous studies <ref type="bibr">[19]</ref>. However, there is still room for improvement in performance.</p><p>In this study, we executed an in-depth analysis employing various machine learning techniques. We also used nine feature encoding methods, including the state-of-the-art methods from natural language processing fields, biologically feasible physical and chemical properties, nucleic acid-based compositional features, and residue composition data. Through this extensive analysis, we sought to uncover the most informative features and optimal machine-learning algorithms for predicting D sites in RNA sequences based on various types of evaluation metrics. As a result, we propose DHUpredET, a machine learning-based method that can significantly outperform previous studies found in the literature. Notably, it excels at leveraging position-specific two nucleotides (PS2) features with appropriate execution. DHUpredET identifies the positive class with greater than 85 % and specifies the negative class with more than 82 % accuracy, which produces a balance outcome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Materials and methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Overall description of the methodology</head><p>In this study, an open-source dataset was employed to train and the proposed method, DHUpredET. Nine feature extraction methods from four descriptor groups were employed to extract relevant information and features for the input of the machine learning algorithms. To find the best fit machine learning models, thirty classification algorithms were applied and finally eight models were chosen based on the performance for further analysis of this study. The overall workflow of the study has been represented in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Dataset descriptions</head><p>In this study, we used the dataset that was introduced in Ref. <ref type="bibr">[19]</ref>. This dataset contains a total of 805 active compounds from different species, such as Homo sapiens, Mus musculus, Saccharomyces cerevisiae, and Escherichia coli <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">20]</ref>. Thereafter, they used a 90 % cluster database at high identity with tolerance (CD-HIT) method <ref type="bibr">[21]</ref> to reduce the redundancy from the positive molecules. As a result, the number of positive samples was reduced to 305 samples. They also collected the same number of negative samples (305) from different studies <ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref>. Finally, they used 244 positive and 244 negative compounds as the training set, while an additional set of 61 positive and 61 negative samples were reserved as the testing set. Table <ref type="table">1</ref> represents a brief idea of the datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Feature extraction</head><p>Feature extraction is a vital step for a better performance of machine learning models because it transforms raw data into a more appropriate representation that allows for successful learning and prediction. In this study, we used various feature extraction techniques to determine the most effective attributes for D sites prediction. These techniques included NLP-based methods such as bidirectional encoder representations from transformers (BERT) <ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref>, FastText word embedding (FastText) <ref type="bibr">[29,</ref><ref type="bibr">30]</ref>, latent semantic analysis (LSA) <ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref>, and document-to-vector (Doc2Vec) <ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref>. Furthermore, physicochemical characteristics such as electron-ion interaction pseudopotentials of trinucleotide (PseEIIP) <ref type="bibr">[37,</ref><ref type="bibr">38]</ref> and dinucleotide physicochemical properties (DPCP) <ref type="bibr">[39]</ref> are also deployed. The study further utilized residue composition-based RNA binary (Binary) <ref type="bibr">[40,</ref><ref type="bibr">41]</ref> and PS2 <ref type="bibr">[42,</ref><ref type="bibr">43]</ref>, in addition to nucleic acid composition-based features including Z-curve phase-specific mononucleotides (Z-curve 9-bit) <ref type="bibr">[44]</ref>. These strategies are intended to improve the accuracy and efficacy of D site detection. In the following subsections, all these methods are described in brief.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">NLP-based feature extractions</head><p>BERT: Bert is well-known in the field of machine learning for making encoder transformers. With a pre-trained approach, it has received a lot of attention and recognition. It encodes contextual by analyzing words bidirectionally, indicating that it considers both left and right circumstances. It tokenizes input text, fine-tunes for certain tasks, and supplies complemented word visualizations, allowing it to excel at various textbased activities <ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref>. The formula of the BERT feature can be expressed as: </p><p>In equations ( <ref type="formula">1</ref>) and ( <ref type="formula">2</ref>), Ma is the vector of the masked word in the sentence, W is the vector associated with a word present in the input sentence, A is the parameter of the approach, O m signifies the output embedding for the masked word, m, W.O m dot product computed between the word vector and the outcome embedding vector, and P(m |W, A) refers to the probability of the word m, conditioned on both W and A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FastText:</head><p>FastText is an NLP model developed by Facebook AI Research that excels at word embedding and text classification tasks. FastText describes languages by organizing symbols across n-grams. As an instance, the pattern of "XYZM" can be split into smaller n-grams such as "XY," "XYZ," "XYM," "YZ," "YZM," and "ZM". The n-grams preserve mathematical data which allows them to encode sentences with equivalent frequencies <ref type="bibr">[29,</ref><ref type="bibr">30]</ref>. The equation can be stated as:</p><p>In equation ( <ref type="formula">3</ref>), F denotes the loss function. It averages a negative logarithm of the anticipated distribution of probabilities P, throughout a sequence of n-gram features denoted as x n, each of which can be represented by word embeddings of a n . The look-up matrices of word embeddings are denoted by L. T represents linear output transformation.</p><p>The SoftMax function's coefficient is applied to the output. LSA: LSA is the most popular NLP-based embedding system in the artificial intelligence fields. It works by initially displaying a collection of text documents as a matrix, with rows representing distinct terms (words) and columns representing documents. Then, it performs singular value decomposition (SVD) on the matrix. LSA decreases the matrix's dimensionality by maintaining just the top k singular values and accompanying singular vectors, resulting in a lower-dimensional semantic space. Each phrase and document are represented as a vector in this semantic space, which captures latent semantic associations <ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref>. The mathematical term of the LSA feature denotes:</p><p>Where, U are the left singular vectors, representing the relationships between terms in the reduced-dimensional semantic space. V N is the transpose of matrix V, representing associations among texts in a reduced-dimensional conceptual environment. Doc2Vec: Doc2Vec is a sophisticated tool for creating document embeddings that capture the semantic content of documents, rendering it suitable for various text analysis purposes. It was introduced by researchers at Google and is based on distributed memory (DM) and distributed bag of words (DBOW) architectures. It operates by learning a neural network to determine the wider context of a document, whereas word-to-vector analyzes the surrounding words of a particular word. Each document is allocated a distinctive vector in a space with high dimensions that capture the semantic meaning <ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref>. The formula can be stated as:</p><p>Here, D is the main vector, N is the number of words, a represents the vectors of each word in the document, V refers to the vector, and N1 is the pooling operation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">Physicochemical-based feature extractions</head><p>DPCP: This is the structural and chemical properties of neighboring nucleotide pairs inside RNA sequences. This approach sheds light on RNA's stability, flexibility, and functional qualities by considering different aspects, including base stacking, hydrogen bonding, and nucleotide interactions. <ref type="bibr">[39]</ref>. The formula is referred to as:</p><p>where D ij is the combined property of nucleotides, P i represents the physicochemical values of the ith nucleotide base in the dinucleotide pair, Q j is the property value of the jth base, and f is a function that combines the properties of two bases for the overall property. PseEIIP: This feature introduces attributes of modifications to the original EIIP values to improve their efficacy in identifying nucleotide sequences. The actual formula for determining PseEIIP values varies based on the individual alterations and characteristics used in the computation. PseEIIP values are often calculated by combining EIIP values (A = 1.1260, T/U = 0.1335, G = 0.0806, C = 0.1340). These values are based on each base's electron-ion interaction potential. EIIP values are useful for a variety of statistical inquiries, such as sequence alignment, motif finding, and identifying functional components in nucleic acid sequences <ref type="bibr">[37,</ref><ref type="bibr">38]</ref>. The formula is represented as:</p><p>In equation <ref type="bibr">(7)</ref>, n is the total number of nucleotides, E i is the electron-ion values of the ith position, and N i is the frequency of the ith.</p><p>In equation ( <ref type="formula">8</ref>), I i represents weights assigned to each nucleotide base or feature in the sequence, E i is the electron-ion values of the ith position, and n is the total number of nucleotides.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.3.">Residue composition-based feature extractions</head><p>Binary Profile: For this feature, each amino acid in the sequence is assigned to a 20-dimensional array with just one non-zero component. This mapping generates a matching matrix, where each row represents an amino acid and each column represents a specific property <ref type="bibr">[40,</ref><ref type="bibr">41]</ref>.</p><p>PS2: The PS2 matrix is commonly 16 &#215; 16, with each row and column representing one of the available dinucleotides (16 in total: AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG, and UU). The value in each cell of the matrix represents the frequency or likelihood of the associated dinucleotide pair occurring at that location in the RNA sequence <ref type="bibr">[42,</ref><ref type="bibr">43]</ref>. The equation can be represented as:</p><p>Where M ij represents the frequency, N-1 is the total number of positions, and count</p><p>is the number of dinucleotide pairs of d ij at position i.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.4.">Nucleic acid composition-based feature extractions Z-curve 9-bit:</head><p>Each nucleotide base (A = 100000000, C = 010000000, G = 001000000, and U = 000100000) in the Z-curve 9-bit representation is translated to a 9-bit binary vector that encodes features like local stacking energy, hydrogen bonding, and base-pairing preferences. These features are determined using the positional connections of neighboring nucleotides <ref type="bibr">[44]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Machine learning models construction</head><p>In this study, different machine-learning models were used to predict D sites. After extensive testing with different hyperparameters, eight models stood out for their outstanding performance. These include AdaBoost (ADB), label propagation (LP), quadratic discriminant analysis (QDA), extreme gradient boosting (XGB), Decision Tre (DT), Random Forest (RF), Decision Jungle (DJ), and an extra tree classifier-based approach (ET), referred to as DHUpredET for PS2 feature extraction procedure as it performed better <ref type="bibr">[45]</ref><ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref><ref type="bibr">[49]</ref><ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref>. These models were chosen based on their ability to reliably predict D sites, assessed by an exhaustive analysis and juxtaposition of their effectiveness indicators. All the hyperparameters used for these classifiers are mentioned in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.1.">DHUpredET model development procedure</head><p>This study used a wide range of algorithms, including ten classical approaches, ten voting-based meta-learning methods, and ten stackbased meta-learning strategies, which have been mentioned in supplementary materials. We chose eight models for their reasonable scores over numerous assessment measures. It was observed that while other models exhibited some improvement in accuracy, they displayed a tendency towards bias in positive class identification. As a result, we selected the Extra Trees Classifier which outperformed all other classifiers using PS2 features.</p><p>The extra tree classifier is a decision tree-based ensemble learning approach that integrates many trees to increase forecast accuracy and resilience. In this design, the model is configured to employ 100 decision trees (n_estimators = 100), which ensures a varied range of classifiers and reduces overfitting. The random_state option is set to 10, guaranteeing that the findings are reproducible over several runs. With max_depth = None, the decision trees can grow until all leaves are pure or there are less than four samples (min_samples_split = 4) at a node, with at least one sample required at each leaf (min_samples_leaf = 1). The max_features option is set to 'auto', indicating that all features are evaluated for determining the optimal split at each node. Bootstrap sampling is deactivated (bootstrap = False), indicating that the full dataset is used to build each tree. The class_weight contention is set to None, which means that all classes are considered equally. The criterion option is set to 'gini', which specifies the criteria for dividing nodes. Finally, setting n_jobs = None indicates that the model will use all available processors for parallel processing. Fig. <ref type="figure">3</ref> Demonstrates the overall procedure of the DHUpredET model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Evaluation metrics</head><p>The study used various evaluation metrics to properly evaluate the models' performance. The metrics used were the Matthews Correlation Coefficient (MCC), sensitivity (Sen), specificity (Spe), precision (Pre), recall (Rec), F1 score (F1s), kappa score (Kpp), and accuracy (Acc) <ref type="bibr">[55]</ref><ref type="bibr">[56]</ref><ref type="bibr">[57]</ref><ref type="bibr">[58]</ref><ref type="bibr">[59]</ref><ref type="bibr">[60]</ref>. These metrics are formulated as follows:  resents false positive, and FN represents false negative.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental results</head><p>To identify the best model for D site prediction, we conducted a thorough analysis, evaluating a wide range of models. Our goal was to identify the best fit model and features suitable for the prediction of D site. Table <ref type="table">3</ref> provides a complete performance evaluation of the selected models. The corresponding results for 5-fold cross-validation are provided as the supplementary material in Table_S1.</p><p>According to Table <ref type="table">3</ref>, most of the applied models have delivered promising results with various extracted features. Among them, ET using PS2 achieved the most superior results. Therefore, ET as a classifier and PS2 as the feature encoding method used to build DHUpredET model. DHUpredET obtains 85.3 rate of accuracy with excellent performances in other evaluation criteria. DHUpredET enhances the prediction accuracy by 10 % compared to other physicochemical-based encoding methods. In NLP-based embedding, ADB and RF models acquired optimal results using LSA and FastText features. Overall evaluations indicate that physicochemical-based PS2 features are the most optimal features. Fig. <ref type="figure">4</ref> shows the results of our comparison study using both the independent test method and the 5-fold cross-validation procedure. Our analysis is focused on comparing F1 scores and accuracy metrics. As shown in this figure, the results achieved for the independent test set and 5-fold cross-validation are consistent, which demonstrates the generality and the robustness of our proposed method. DHUpredET consistently achieves excellent F1s and Acc across both assessment methodologies. This implies that DHUpredET has been effective at reducing both false positives and false negatives, indicating its effectiveness for its intended application. In the process of obtaining excellent scores in both measures across many evaluation techniques (independent test and cross-validation), DHUpredET demonstrates stability and durability in performance evaluation.</p><p>The receiver operating characteristic (ROC) curve and precisionrecall (PR) curve are two commonly used evaluation metrics in the data science fields <ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref>. The ROC curve illustrates the relationship between Sen and Spe, while the PR curve provides information about a model's efficiency, particularly when dealing with unbalanced datasets in which the number of negative cases significantly exceeds the positive ones. Fig. <ref type="figure">5</ref> shows the ROC and PR curves for our models using the independent test. Notably, the DHUpredET model performs adequately in all the subplots (A-I). In Subplot A (PS2, containing ROC and PR curves), DHUpredET receives noteworthy scores of 91.1 % in ROC and 91.5 % in PR curves. These results highlight that the model appropriately identifies both positive and negative categories, excelling at detecting positive examples while remaining specific for negative ones.</p><p>On the other hand, the PR curves demonstrate the model's outstanding performance. It also demonstrates the usefulness of the model in the presence of biased data sets. As shown in subplot D (Binary, containing ROC and PR curves), our model performed better with 92.3 % of PR curves. Based on the findings of the present study, DHUpredET  reliably performed the best across all subplots (A-I) compared to other models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Feature analysis</head><p>As our study found PS2 features have the most optimal attributes, therefore in Fig. <ref type="figure">6</ref>, We analyzed the contribution of each feature in our  model's decision-making in PS2 feature extractions. Based on the results, we can conclude that the PS2_299 feature is the most important in our model, as indicated by its significant value achieved of 0.035. The PS2_471 feature follows closely after, with a notable significance score of 0.030, suggesting that it contributes considerably to the model's predictive capabilities. Furthermore, the PS2_316 feature appears as a significant contributor, instead with a significantly lower relevance score of 0.025. These feature-significance ratings provide valuable details about our model's explanatory capacity. It is apparent that the three features-PS2_299, PS2_471, and PS2_316-play vital parts in the model's decision-making process. Their prominence emphasizes their importance in capturing the underlying patterns involved with D site identification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Compariosn of DHUpredET with the previous models</head><p>In Table <ref type="table">4</ref>, we compare DHUpredET to the state-of-the-art methods, evaluating their performance across key metrics including Acc, Sen, AUC, MCC, PR, and Spe.</p><p>As shown in Table <ref type="table">4</ref>, DHUpredET achives an accuracy of 85.3 %, MCC of 70.53 %, Sen of 56.9 %, AUC of 91.1 %, and PR of 91.5 %. These results greatly outperform previous methodologies. DHUpredET model improved accuracy by more than 5 % compared to all other models on the independent test dataset. Despite the greatest increase in accuracy, our model performed adequately across various evaluation metrics. The considerable improvement in MCC of more than 0.2 and Sen of more than 10 % demonstrates that our model efficiently identifies positive classes while maintaining a balanced Spe for negative classes. Additionally, the AUC score increased by more than 3 %, while the PR score boosted by 1 %. The above results indicate that our model outperforms state-of-the-art approaches and is well-suited to properly discriminating against positive and negative classes in DHU prediction.</p><p>The comparison diagram in Fig. <ref type="figure">7</ref> visually illustrates the performance gap between the DHUpredET model and other methods. DHU-predET occupies a larger space, indicating its superior performance across multiple evaluation metrics compared to other methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Dysregulation of RNA modifications has been associated with a variety of diseases, including malignancy and neurological disorders. Identifying DHU alteration sites may provide insight into disease causes and possible treatment targets. Experimental approaches for detecting RNA modifications, such as DHU, can be time-consuming and costly, with varying degrees of sensitivity and specificity. Abnormal DHU modifications have been associated to illnesses such as cancer, neurological problems, and metabolic diseases. Understanding RNA modifications could help in the development of tailored therapies such as RNAbased medicines and modified tRNA therapy. Detecting DHU modification in certain RNA sequences can help with RNA-targeted therapeutics and precision diagnostics. PS2 demonstrates the localized sequence context surrounding the alteration point. This feature analyzes sequence variations to predict RNA modifications. It uses two neighboring nucleotides at each place to encode sequence patterns. PS2 characteristics aid in the identification of changed vs unmodified regions in RNA sequences. It has sequence specificity, which makes it beneficial for assessing unknown RNA sequences using our suggested model. Identifying RNA alterations facilitates gene editing.</p><p>Computational techniques provide an alternative way for predicting modification sites with adequate accuracy, improving the prioritizing of experimental validation efforts. These approaches are important because they allow for the systematic analysis of massive datasets that would be hard to manage manually and provide a scalable answer to comprehending complicated biological processes by allowing for the full identification of D sites across a wide range of RNA molecules. The computer-based approach is especially useful when preliminary high- To conduct an exhaustive comparative analysis, we utilized a variety of feature extraction approaches and analyzed 30 algorithms for each feature set. After a comprehensive analysis, we evaluated the final models based on their performance and selected the best features and methods. We explored integrating several feature sets, such as physicochemical features only, compositional features only, and all features paired together, then observed that PS2 features outperformed all of them. As a result, we adjusted our attention to using specific factors, favoring both best-fit models and features. The aim was to develop a cost-effective model without jeopardizing functionality. After rigorous testing and hyperparameter optimization, we determined that the ET model was the best choice. Using PS2 features, which emphasize local interdependence between nearby nucleotides, enables the model to improve its prediction effectiveness by exploiting perplexing nucleotide interactions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>Dihydrouridine is a RNA modification that influences both the structural and functional patterns of RNA, playing an essential role in manufacturing proteins and cellular adaptation processes. Here we developed a new machine learning technique, called DHUpredET, for predicting DHU using extra tree as a classifier and PS2 feature encoding. Our proposed framework achieved 85.3 % in terms of prediction accuracy, significantly outperforming the state-of-the-art models found in the literature by over 5 %. Here we also comprehensively study variables that are strongly connected to the identification of D sites. The DHU-predET model improves our ability to find dihydrouridine sites with excellent accuracy and enhances our understanding of the underlying processes that control RNA modifications from the computational perspective. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CRediT authorship contribution statement</head><note type="other">Md</note></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Analytical Biochemistry 702 (2025) 115828</p></note>
		</body>
		</text>
</TEI>
