<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Persistent Sheaf Laplacian Analysis of Protein Flexibility</title></titleStmt>
			<publicationStmt>
				<publisher>The Journal of Physical Chemistry B</publisher>
				<date>05/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10616129</idno>
					<idno type="doi">10.1021/acs.jpcb.5c01287</idno>
					<title level='j'>The Journal of Physical Chemistry B</title>
<idno>1520-6106</idno>
<biblScope unit="volume">129</biblScope>
<biblScope unit="issue">17</biblScope>					

					<author>Nicole Hayes</author><author>Xiaoqi Wei</author><author>Hongsong Feng</author><author>Ekaterina Merkurjev</author><author>Guo-Wei Wei</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Protein flexibility, measured by the B-factor or Debye-Waller factor, is essential for protein functions such as structural support, enzyme activity, cellular communication, and molecular transport. Theoretical analysis and prediction of protein flexibility are crucial for protein design, engineering, and drug discovery. In this work, we introduce the persistent sheaf Laplacian (PSL), an ekective tool in topological data analysis, to model and analyze protein flexibility. By representing the local topology and geometry of protein atoms through the multiscale harmonic and nonharmonic spectra of PSLs, the proposed model ekectively captures protein flexibility and provides accurate, robust predictions of protein B-factors. Our PSL model demonstrates an increase in accuracy of 32% compared to the classical Gaussian network model (GNM) in predicting B-factors for a data set of 364 proteins. Additionally, we construct a blind machine learning prediction method utilizing global and local protein features. Extensive computations and comparisons validate the ekectiveness of the proposed PSL model for B-factor predictions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Proteins are pivotal to life, playing an essential role in many biological processes, including signaling, gene regulation, transcription, translation, interaction with a protein or substrate molecule, etc. <ref type="bibr">1</ref> They are composed of amino acids, which form polypeptide chains and fold into specific threedimensional (3D) structures. There are four levels of protein structures: primary, secondary, tertiary, and quaternary. The primary structure is the linear sequence of amino acids, whereas the secondary structure refers to &#945;-helices and &#946;sheets due to hydrogen bonds and electrostatic interactions. The tertiary structure corresponds to the 3D shape of a single polypeptide chain, while the quaternary structure describes the global arrangement of multiple polypeptide chains into a functional complex. <ref type="bibr">2</ref> Proteins have various functions; most notably, some of the functions of proteins include catalyzing metabolic reactions (enzymes), providing structural support (e.g., collagen in connective tissues), facilitating cellular communication (e.g., receptors and signaling molecules), and transporting molecules (e.g., hemoglobin for oxygen transport). These functions originate from their 3D structures. In particular, protein structure flexibility is a vital characteristic of protein structure that is essential to protein functions. <ref type="bibr">3</ref> Specifically, protein flexibility enables proteins to adapt to various shapes and conditions, which facilitate their interactions with other molecules, such as DNA, RNA, ions, cofactors, ligands, and other small molecules. Under physiological conditions, proteins undergo constant thermal fluctuation, which enables the proteins to bind substrates, catalyze reactions, and transmit signals. Enzymes, for example, exhibit an induced fit mechanism, where their active sites adapt complementary shapes to accommodate substrates, improving the catalytic eLciency. In a similar way, molecular motors, such as myosins and kinesins, utilize flexibility to enable directed movement during muscle contraction and intracellular transport.</p><p>Protein flexibility can be measured by the B-factor, also known as the Debye-Waller factor, which measures the attenuation of X-ray or neutron scattering due to thermal motion of atoms in protein crystallography. Specifically, the Bfactor is defined according to the mean displacement of a scattering center in X-ray dikraction data. <ref type="bibr">4,</ref><ref type="bibr">5</ref> The B-factor is used to describe the flexibility of atoms and/or amino acids within a protein structure, and it further provides valuable information about the protein's thermal motion, structural stability, activity, and other protein functions. <ref type="bibr">6</ref> Protein flexibility has been intensively studied in computational biophysics in recent decades. <ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref> In addition to the thoroughly investigated flexibility of proteins involved in folding, folded proteins (i.e., proteins in their native conformations) are also flexible and, in fact, exhibit internal motion in neighborhoods of their native conformations. <ref type="bibr">11,</ref><ref type="bibr">12</ref> In a seminal work, McCammon et al. <ref type="bibr">11</ref> investigated such local motion in a small folded globular protein using a molecular dynamics (MD) approach, demonstrating the fluid-like characteristics of the internal motions. However, analyzing the dynamics of a large protein would require simulations at time scales that are intractable for the MD approach. <ref type="bibr">13</ref> Consequently, other methods have since emerged using a time-harmonic approximation <ref type="bibr">14</ref> to the protein's potential energy function used in MD, resulting in time-independent techniques. Such methods include normal-mode analysis (NMA) <ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref> and elastic network models (ENMs). <ref type="bibr">[19]</ref><ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref> Some of the most popular methods <ref type="bibr">13,</ref><ref type="bibr">25,</ref><ref type="bibr">26</ref> for protein flexibility analysis include the Gaussian network model (GNM) <ref type="bibr">21,</ref><ref type="bibr">27,</ref><ref type="bibr">28</ref> and anisotropic network model (ANM), <ref type="bibr">19</ref> both of which are types of ENMs. The GNM approach treats the protein as a network, with the residues representing the junctions. B-Factors are then approximated using the first few eigenvalues of the connectivity matrix, which correspond to the long-time dynamics of proteins that MD simulations are unable to capture. <ref type="bibr">29</ref> Moreover, multiple methods have emerged as modifications of the original GNM and ANM models, including generalized GNM (gGNM), multiscale GNM (mGNM), and multiscale ANM (mANM). <ref type="bibr">26</ref> Such methods attempt to improve the eLciency and accuracy of GNM and ANM. Due to their ability to capture multiscale information intrinsic to protein structures, mGNM and mANM models have been shown <ref type="bibr">26</ref> to significantly improve B-factor predictions of proteins compared to the original GNM and ANM methods.</p><p>Other algorithms, such as the flexibility-rigidity index (FRI), <ref type="bibr">13</ref> which relies on the theory of continuum elasticity with atomic rigidity (CEWAR), have also improved results for B-factor prediction over the original GNM method. The FRI is based on the assumption that protein functions depend solely upon the protein's structure and environment, and therefore it assesses flexibility and rigidity by analyzing the topological connectivity and geometric compactness of protein structures. A benefit of the flexibility-rigidity index is that it bypasses the Hamiltonian interaction matrix and matrix diagonalization. Consequently, the FRI has significantly reduced computational complexity compared to other algorithms for protein flexibility analysis. Additional modifications, including fast FRI (fFRI), <ref type="bibr">25</ref> anisotropic FRI (aFRI), <ref type="bibr">25</ref> and multiscale FRI (mFRI), <ref type="bibr">30</ref> have been developed to further improve the eLciency of FRI as well as its accuracy on structures that are diLcult for the NMA, GNM, and FRI algorithms. <ref type="bibr">30</ref> Recently, many machine learning approaches have been developed for protein flexibility analysis. For example, sequence-based predictions have been reported, <ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref> and other machine-learning-based predictions of protein flexibility have also been proposed. <ref type="bibr">[33]</ref><ref type="bibr">[34]</ref><ref type="bibr">[35]</ref> More recently, a method that utilizes both sequence information and structure information has been developed for protein B-factor prediction. <ref type="bibr">36</ref> In 2019, persistent topological Laplacians (PTLs) <ref type="bibr">37,</ref><ref type="bibr">38</ref> were first introduced to overcome certain drawbacks of persistent homology, a key technique used in topological data analysis (TDA). <ref type="bibr">39,</ref><ref type="bibr">40</ref> Many PTLs have been proposed in the past few years, including the persistent combinatorial Laplacian, the persistent path Laplacian, the persistent sheaf Laplacian (PSL), <ref type="bibr">41</ref> the persistent directed graph Laplacian, and the persistent hyperdigraph Laplacian. <ref type="bibr">42</ref> Most of these algorithms are global, okering the topological and geometric descriptions of all objects in their topological space. In other words, they </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Journal of Physical Chemistry B</head><p>generate information about the protein as a whole. However, for protein flexibility analysis, one must have a method to describe the local properties of individual atoms. The PSL model serves such a function, as it allows the assignment of a specific weight at each node (or atom); thus, it provides local topological and geometric information in its spectra, making it suitable for protein flexibility analysis.</p><p>The aim of the present work is to demonstrate the utility of the PSL model for protein flexibility analysis via the prediction of protein B-factors. The remainder of this manuscript is organized in the following manner: all results of this work are given in Section 2. Section 2.1 summarizes our results on protein subsets from the literature, and Section 2.2 presents the performance of the PSL model on individual proteins that are challenging for the GNM. Section 2.3 details the results for blind machine learning prediction using the PSL model. In Section 3, we describe the algorithms used in this manuscript, including some background on persistent homology and cellular sheaves.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RESULTS</head><p>In this section, we present our results for experiments applying the persistent sheaf Laplacian (PSL) model as outlined in the previous section. Figure <ref type="figure">1</ref> summarizes the methods used to generate the results throughout this section.</p><p>2.1. Results on Protein Subsets. 2.1.1. Data Sets. To demonstrate the persistent sheaf Laplacian model's performance on proteins of various sizes, we conducted computational experiments on four data sets. Three of these data sets were constructed by Park et al. <ref type="bibr">14</ref> as sets of relatively small-, medium-, and large-sized protein structures. There are 33 proteins in the set of small-sized proteins, 36 in the set of medium-sized proteins, and 35 in the set of large-sized proteins. The fourth data set is a superset constructed by Opron et al. <ref type="bibr">25,</ref><ref type="bibr">30</ref> consisting of (1) the three aforementioned sets, (2) 40 proteins of varying sizes randomly selected from the Protein Data Bank (PDB), <ref type="bibr">43</ref> and (3) 263 high-resolution protein structures used by Xia et al. <ref type="bibr">13</ref> in tests of their FRI algorithm, with the duplicates subsequently removed (note that in their earlier paper, Opron et al. <ref type="bibr">25</ref> used a set of 365 proteins, but their later manuscript <ref type="bibr">30</ref> excluded the protein with PDB ID 1AGN due to an unrealistic B-factor. The present paper utilizes the updated set consisting of 364 proteins).</p><p>Additionally, all protein data sets used for B-factor prediction in the present study were preprocessed to contain only the C &#945; atoms from their respective proteins. As discussed by Xia et al., <ref type="bibr">13</ref> the B-factor for an arbitrary atom in a protein is associated with that atom's flexibility, but its B-factor may be akected by dikraction in data collection, preventing a direct interpretation of flexibility. However, the B-factors of C &#945; atoms correlate directly with their atomic flexibility. Accordingly, our B-factor predictions in this work can be interpreted as atomic flexibility predictions.</p><p>Table <ref type="table">1</ref> displays the results of the PSL model compared to other methods on the data sets of small, medium, and large proteins as well as the superset.</p><p>2.1.2. Parameters and Results. For all PSL results in this section and Section 2.2, we utilized a filtration induced by three radii: 6, 9, and 12 &#197;. For each radius, we generate a zeroth persistent sheaf Laplacian matrix L 0 and compute its eigenvalues, then compute the maximum, minimum, mean, and median of the set of nonzero eigenvalues, as well as the number of zero eigenvalues. These quantities comprise five features for each radius, resulting in 15 features in total for each residue. To obtain the B-factor predictions in this section, we performed linear regression using the set of PSL features for the full set of 364 proteins as well as the subsets.</p><p>To better assess the performance of the PSL method relative to other approaches and to avoid overfitting, we did not perform an extensive search for the optimal filtration radii and eigenvalue statistic parameters for each task below. Rather, we conducted experiments on the set of 364 proteins with a few sets of parameters and chose those that yielded a good average Pearson correlation coeLcient over the entire set. The above parameters may be tuned to further improve model performance for a given task&#57557;higher-order persistent sheaf Laplacian matrices and their respective eigenvalues may also be used to generate such features, and other statistics may be used as well, such as the standard deviation of the nonzero eigenvalues. Moreover, suitable filtration radii may be chosen to capture desired multiscale information for a given protein. Another example of PSL feature generation can be seen in Section 2.3.2.</p><p>The PSL model achieves improved performance over all other compared methods on all data sets shown in Table <ref type="table">1</ref>. In particular, the PSL model improves the benchmark GNM by 32%. <ref type="bibr">30</ref> the Gaussian network model (GNM) experiences diLculty in predicting B-factors for certain protein structures. In addition to the comparison shown in Table <ref type="table">1</ref>, in this section, we examine a few case studies of particular proteins to demonstrate the success of the PSL model on such structures. All protein structural visualizations were generated using the visual molecular dynamics software (VMD), <ref type="bibr">44</ref> and residues of each protein are assigned colors based on their experimental or predicted B-factors. Lower B-factors are shown as blue (corresponding to "colder" or more rigid residues), and higher B-factors are shown as red (corresponding to "warmer" or more flexible residues). All GNM results were obtained using the default GNM model with a cutok of 7 &#197;. Experiments were conducted on the full set of 364 proteins as well as three subsets of small, medium, and large protein structures as described by Park et al. <ref type="bibr">14</ref> ASPH denotes the atom-specific persistent homology method developed by Bramer et al., <ref type="bibr">5</ref> with results using Bottleneck (B) and Wasserstein (W) metrics displayed. Both sets of ASPH results used both an exponential and Lorenz kernel for least-squares fitting. opFRI and pfFRI results are from Opron et al., <ref type="bibr">25</ref> and GNM and NMA results are from Park et al. <ref type="bibr">14</ref> The Journal of Physical Chemistry B Calmodulin is a calcium detector within the cells and plays a significant role in numerous cellular pathways. Its flexibility allows it to interact with varied target proteins. Figure <ref type="figure">2</ref> displays the predicted and experimental B-factors for the calcium-binding protein calmodulin (PDB ID: 1CLL) <ref type="bibr">43</ref> using our persistent sheaf Laplacian model as well as the Gaussian network model. We observe that the Gaussian network model produces a large error in B-factor prediction for residues from about 65-85. These residues correspond to a flexible hinge region of the protein. <ref type="bibr">30</ref> The root-mean-square error (RMSE) for the PSL model is 9.14 for calmodulin, a 23% decrease from the GNM model's RMSE of 11.9.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Individual Protein Case Studies. As Opron et al. discussed in their 2015 work,</head><p>Next, we consider a monomeric cyan fluorescent protein (mTFP) that emits cyan light. It is used in biological experiments to visualize specific targets. Figure <ref type="figure">3</ref> shows experimental B-factors and predicted B-factors of the protein mTFP1 (PDB ID: 2HQK). Again, the predicted B-factors shown were computed using the Gaussian network model and our PSL model. As in the results for the protein calmodulin, the GNM is unable to correctly predict B-factors for one range  Here, however, the Gaussian network model overestimates the B-factors in this region, visible in the GNM structural representation as the red &#945;-helix in the center of the &#946;barrel. <ref type="bibr">30</ref> Opron et al. <ref type="bibr">30</ref> observed that using a cutok of 8 &#197; for GNM somewhat resolves this error, and they suggested that the GNM may experience diLculty in this region due to its use of hard thresholds based on connectivity parameters. The persistent sheaf Laplacian model is significantly more accurate in this region, likely due to the fact that it captures atomspecific information as well as molecular information at multiple scales. Overall, the PSL model improves the RMSE on mTFP1 to 3.43 from 8.74 for the GNM, a nearly 61% decrease.</p><p>We further consider a probable antibiotics synthesis protein from Thermus thermophilus. In Figure <ref type="figure">4</ref>, we investigate the experimental and predicted B-factors of this protein (PDB ID: 1V70). On this protein, our persistent sheaf Laplacian model is able to predict the B-factors accurately across all residues of the protein, while the Gaussian network model experiences a high level of inaccuracy on residues from about 0-10. This vast over-prediction contributes to a very high RMSE value for the GNM, at 17.9. Our PSL model achieves a significantly lower  The Journal of Physical Chemistry B RMSE of 2.78 on the protein 1V70, 84% lower than that of the GNM.</p><p>Finally, we studied the ribosomal protein L14 (PDB ID: 1WHI), <ref type="bibr">30</ref> one of the most conserved ribosomal proteins. It functions as an organizational component of the translational apparatus. In Figure <ref type="figure">5</ref>, we show the experimental and predicted B-factors for the ribosomal protein L14. Again, we observe that the GNM overestimates the flexibility of some regions of this protein, most significantly for the residues around 60-80. The RMSE for the PSL model on this protein is nearly half that of the GNM model, whose RMSE is 6.59.</p><p>2.3. Blind Machine Learning Prediction. 2.3.1. Data Sets. Two data sets, one from Opron et al. <ref type="bibr">25,</ref><ref type="bibr">30</ref> and the other from Park et al. <ref type="bibr">14</ref> are used in our work. The first data set contains 364 proteins, <ref type="bibr">25,</ref><ref type="bibr">30</ref> and the second <ref type="bibr">14</ref> has three sets of proteins with small, medium, and large sizes, which are the subsets of the 364 protein set.</p><p>In our blind predictions, proteins 1OB4, 1OB7, 2OXL, and 3MD5 from the superset are excluded because the STRIDE software cannot generate features for these proteins. We exclude protein 1AGN due to the known problems with this protein data. <ref type="bibr">25,</ref><ref type="bibr">30</ref> Additional proteins from the superset are also excluded. Proteins 1NKO, 2OCT, and 3FVA are excluded because these proteins have unphysical B-factors (i.e., zero values). We also excluded proteins 3DWV, 3MGN, 4DPZ, 2J32, 3MEA, 3A0M, 3IVV, 3W4Q, 3P6J, and 2DKO due to inconsistent protein data processed with STRIDE compared to original PDB data. A total of 346 proteins are used for blind predictions. Those data can be found in our provided GitHub repository.</p><p>2.3.2. PSL Features. The second approach to B-factor prediction that we examined is a blind prediction for protein Bfactors. We use PSL features as local descriptors of protein structures, applying three cutok distances, i.e., 7, 10, and 13 &#197;, to define the atom groups used to construct a sheaf Laplacian matrix. For each cutok distance, we generate a sheaf Laplacian matrix, L 1 , with a filtration radius matching the cutok distance. From each matrix, we extract five features: the count of zero eigenvalues, and the maximum, minimum, mean, and standard deviation of the nonzero eigenvalues. Together, these provide 15 PSL features for blind machine learning predictions.</p><p>2.3.3. Additional Features. In addition to PSL features, we extract a range of global and local protein features for building machine learning models. Each PDB structure is associated with global features, such as the R-value, resolution, and the number of heavy atoms, which are extracted from the PDB files. These features enable the comparison of the B-factors in dikerent proteins. The local characteristics of each protein consist of packing density, amino acid type, occupancy, and secondary structure information generated by STRIDE. <ref type="bibr">45</ref> STRIDE provides comprehensive secondary structure details for a protein based on its atomic coordinates from a PDB file, classifying each atom into categories such as &#945;-helix, 3-10helix, &#960;-helix, extended conformation, isolated bridge, turn, or coil. Furthermore, STRIDE provides &#981; and &#968; angles and residue solvent-accessible area, contributing a total of 12 secondary features. In our implementation, we use one-hot encoding for both amino acid types and the 12 secondary features. The packing density of each C &#945; atom in a protein is calculated based on the density of surrounding atoms, with short, medium, and long-range packing density features defined for each C &#945; atom. The packing density of the ith C &#945; atom is defined as</p><p>where d represents the specified cutok distance in &#197;, N d denotes the number of atoms within the Euclidean distance d from the ith atom, and N is the total number of heavy atoms in the protein. The packing density cutok values used in this study are provided in Table <ref type="table">2</ref>.</p><p>Our PSL features, combined with the global and local features provided for each PDB file, oker a comprehensive feature set for each C &#945; atom in the protein. For blind predictions, we integrate these features with machine learning algorithms to build regression models. To evaluate the performance of our machine learning model on blind predictions, we conducted two validation tasks: 10-fold cross-validation and leave-one-(protein)-out validation. For 10-fold cross-validation, we designed two types of experi-ments&#57557;one based on splitting by PDB files and another on splitting by all C &#945; atoms collected from the PDB files. Our modeling and predictions are centered on the B-factors of C &#945; atoms.</p><p>2.3.4. Evaluation Metrics. To assess our method for Bfactor prediction, we use the Pearson correlation coeLcient (PCC) Here B e and B t are the averaged B-factors.</p><p>2.3.5. Machine Learning Algorithms. For the blind predictions, instead of using more sophisticated methods, <ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref> we consider two simple machine learning algorithms, namely gradient-boosting decision trees (GBDT) and random forests (RF), to highlight the proposed PSL method. The hyperparameters of these two types of algorithms are given in Table <ref type="table">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.6.">Machine Learning Results</head><p>. We carried out several experiments, the first of which is a leave-one-(protein)-out prediction using the four data sets described above. We trained models five times independently with dikerent random seeds and calculated the average Pearson correlation coeLcients from the ten sets of modeling predictions. Our results are The Journal of Physical Chemistry B</p><p>shown in Table <ref type="table">4</ref>, where the GBDT-based models yield better predictions than the RF-based models, as expected.</p><p>In our study, we additionally carried out 10-fold crossvalidation at the protein level. In each fold, we use nine out of the ten subsets of the 346 proteins to train our model, while the remaining subset is reserved for testing. Specifically, features of C &#945; atoms in the training proteins are pooled together to train the models, while those in the test proteins are used for evaluation. This process is repeated across ten dikerent splits. Table <ref type="table">5</ref> shows the average PCC values for two types of machine learning models. Again, the GBDT model gives better predictions than the RF model.</p><p>We also performed an alternative C &#945; -level 10-fold crossvalidation. The data set consists of more than 74,000 C &#945; atoms from 364 proteins. In each of ten independent models, nine out of ten subsets of C &#945; atoms are used to train the models, while the remaining subset is used for testing. As shown in Table <ref type="table">6</ref>, GBDT modeling yields slightly better predictions than RF-based modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">METHODS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Persistent Homology and Persistent Laplacians.</head><p>As one of the most abstract mathematical subjects, homology excessively simplifies complex geometry. In contrast, persistent homology balances simplification and information retrieval in data analysis and is widely used in topological data analysis. <ref type="bibr">39,</ref><ref type="bibr">40</ref> However, persistent homology has several draw-backs, including its insensitivity to homotopic shape evolution. To address this challenge, the persistent spectral graph, also known as persistent Laplacians, was introduced on simplicial complexes in 2019. <ref type="bibr">37</ref> Since then, various persistent Laplacians, or persistent topological Laplacians, have been proposed for dikerent topological objects, such as path complexes, directed flag complexes, hyperdigraphs, and cellular sheaves. <ref type="bibr">42</ref> Given a finite set V, a simplicial complex X is a collection of subsets of V, such that if a set &#963; is in X, then any subset of &#963; is also in X. A set &#963; that consists of q + 1 elements is referred to as a q-simplex. If &#963; is a subset of &#964;, then we say that &#963; is a face of &#964; and denote the face relation by &#963; &#10877; &#964;. If X and Y are simplicial complexes and X &#8834; Y, then X is referred to as a subcomplex of Y. A simplicial complex X gives rise to a simplicial chain complex</p><p>The real vector space C q (X) is generated by q-simplices. An element of C q (X) is called a q-chain. The boundary operator &#8706; q is a linear map defined by</p><p>, ..., , ...,</p><p>where the symbol v a i means that v a i is deleted. The total ordering of V ensures that the boundary operator is welldefined. The q-th homology group H q = ker &#8706; q /im&#8706; q+1 is welldefined since &#8706; 2 = 0. Now suppose X is a sub-complex of Y.</p><p>Then we have the following diagram where hooked dashed arrows represent inclusion maps &#953;: C q (X) &#8594; C q (Y). The inclusion &#953; induces a map &#953; &#8226; : H q (X) &#8594; H q (Y). The q-th persistent homology for the pair (X, Y) is the image</p><p>Usually the ranks of persistent homology groups are represented by barcodes, where each bar represents a topological feature that persists in the filtration, okering a multiscale topological characterization of the input point cloud. <ref type="bibr">39,</ref><ref type="bibr">40</ref> Recently, the theory of persistent Laplacians 37 has been proposed to extract additional information from a point cloud. A persistent Laplacian is a positive semidefinite operator whose kernel is isomorphic to the corresponding persistent homology group. The additional information provided by the nonzero eigenvalues of persistent Laplacians can be learned by machine learning algorithms. Since C q (X) is generated by q-simplices, it is equipped with a canonical inner product.</p><p>X,Y . The q-th persistent Laplacian &#916; q X,Y is defined by</p><p>where &#8224; denotes the adjoint of a linear morphism. Using basic linear algebra we can prove that the kernel of &#916; q X,Y is isomorphic to &#953; &#8226; (H q (X)). Generally speaking, any method that utilizes multiscale Laplacians to analyze data can be referred to as a persistent Laplacian method.   The Journal of Physical Chemistry B 3.2. Cellular Sheaves and Persistent Sheaf Laplacians. Molecular structures often contain important nonspatial information, and many applications of topological methods in analyzing molecular data require integration of nonspatial information. For example, we can use generalized distance to model the biochemical interaction between atoms or only use specific types of atoms as input to persistent homology <ref type="bibr">49</ref> or persistent Laplacians. <ref type="bibr">37</ref> An alternative approach is to integrate biological information through the construction of (co)chain complexes and extend persistent homology and persistent Laplacians to new settings. For example, one can construct a filtration of cellular sheaves and consider the persistence module of sheaf cochain complexes instead of simplicial complexes and simplicial chain complexes. <ref type="bibr">50</ref> Roughly speaking, a cellular sheaf is a simplicial complex X with an assignment to each simplex &#963; of X a finitedimensional vector space ( ) (referred to as the stalk of over &#963;) and to each face relation &#963; &#10877; &#964; (i.e., &#963; &#8834; &#964;) a linear morphism of vector spaces denoted by (referred to as the restriction map of the face relation &#963; &#10877; &#964;), satisfying the rule = and is the identity map of ( ). We can view stalks as information stored for each simplex, and restriction maps as the way this information interacts. A cellular sheaf gives rise to a sheaf cochain complex The q-th sheaf cochain group C X ( ; ) q is the direct sum of stalks over q-dimensional simplices. To define coboundary maps d, we can globally orient the simplicial complex X and obtain a signed incidence relation, an assignment to each &#963; &#10877; &#964; a n i n t e g e r is defined by</p><p>Now suppose we have on X and on Y such that X &#8838; Y and stalks and maps of X are identical to those of Y. If each stalk is an inner product space then we have the following diagram where</p><p>and</p><p>(&#960; is the projection map from C Y ( ; ) q to its subspace C X ( ; ) q ). We define the q-th persistent sheaf Laplacian q , by</p><p>When = , the persistent sheaf Laplacian is equal to the sheaf Laplacian of . When and are constant sheaves, persistent sheaf Laplacians coincide with persistent Laplacians. Since a sheaf co-chain complex is constructed through stalks and restriction maps, we expect that persistent sheaf co-homology and persistent sheaf Laplacians contain additional information besides the underlying simplicial complex.</p><p>If a simplicial complex X is labeled (each vertex is associated with a quantity q), then a sheaf can be constructed as follows. Let F X : be a nowhere-zero function. We let each stalk be , and for the face relation [v 0 , ..., v n ] &#10877; [v 0 , ..., v n , v n+1 ..., v m ] (here orientation is not relevant), the linear morphism</p><p>is the scalar multiplication by</p><p>, ..., , , ..., ) n n m n n m 0 1 0 1</p><p>For a labeled point cloud (a point cloud where each point is associated with a quantity), if we construct a filtration of the point cloud, then for each complex in the filtration we can construct a sheaf as described above. This leads to a filtration of sheaves such as in persistent sheaf co-homology <ref type="bibr">51</ref> and persistent sheaf Laplacians. <ref type="bibr">41</ref> The harmonic spectra of PSLs reveal the topological invariants, while the nonharmonic spectra represent geometric information on the data. <ref type="bibr">41,</ref><ref type="bibr">42</ref> In this work, we use sheaf Laplacians to construct features for individual C &#945; atoms. For a given atom A, we first pick a cutok distance and only consider the nearby C &#945; atoms within the cutok. Then we choose a radius and build an alpha complex X out of these C &#945; atoms. A cellular sheaf on X is constructed as follows. We denote an atom in X by v i . We assign a label q i to v i , then, we let each stalk be . For face relation v i &#10877; v i v j , the restriction map is the scalar multiplication by q j /r ij , where r ij is the length of v i v j . For face relation v i v j &#10877; v i v j v k , the restriction map is the scalar multiplication by q k /(r ik r jk ). Since we want to distinguish the C &#945; atom A from the other atoms, we let the label of A be 0, and the labels of other nearby C &#945; atoms be 1. The features are then obtained from the spectra of sheaf Laplacians for this specific C &#945; atom A. In this manner, we can construct sheaf Laplacian features for all C &#945; atoms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSION</head><p>Protein flexibility is crucial for protein functions, and its prediction is essential for understanding protein properties, protein design, and protein engineering. However, the intrinsic complexity of proteins and their interactions present challenges in understanding protein flexibility. To address this, many ekective computational approaches have been developed to predict B-factor values, which reflect protein flexibility. In the literature, a variety of techniques have been proposed, including NMA, <ref type="bibr">16</ref> GNM, <ref type="bibr">20,</ref><ref type="bibr">21</ref> pfFRI, <ref type="bibr">25</ref> ASPH, 5 opFRI, <ref type="bibr">25</ref> and EH. <ref type="bibr">52</ref> In this study, we propose a persistent sheaf Laplacian (PSL) model for protein B-factor prediction. Sheaf theory, a branch of algebraic geometry, serves as the foundation for PSL, a novel approach to topological data analysis (TDA). Unlike many global TDA tools, PSL is a localized method that captures the local topology of a point within the data. Similarly to other TDA methods, PSL also provides a multiscale analysis of the system under study.</p><p>The multiscale nature of PSL allows it to capture atomic interactions across dikerent distance ranges, enabling a more ekective analysis of protein flexibility. This characteristic makes the proposed method superior to traditional approaches, such as GNM, which fail to account for atomic interactions beyond a specific cutok distance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Journal of Physical Chemistry B</head><p>For cross-protein prediction, we further enhance the PSL by integrating additional global and local features intrinsic to protein structures and structure determination conditions. This integration enables the blind prediction of protein B-factors, which is particularly valuable for assessing protein flexibility when experimental B-factors are unavailable. The proposed PSL model has been validated using various data sets, demonstrating its ekectiveness and robustness in protein flexibility analysis.</p><p>( ASSOCIATED CONTENT</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>https://doi.org/10.1021/acs.jpcb.5c01287 J. Phys. Chem. B 2025, 129, 4169-4178</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>The Journal of Physical Chemistry B</p></note>
		</body>
		</text>
</TEI>
