<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Coevolving residues inform protein dynamics profiles and disease susceptibility of nSNVs</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/29/2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10110288</idno>
					<idno type="doi">10.1371/journal.pcbi.1006626</idno>
					<title level='j'>PLOS Computational Biology</title>
<idno>1553-7358</idno>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue">11</biblScope>					

					<author>Brandon M. Butler</author><author>I. Can Kazan</author><author>Avishek Kumar</author><author>S. Banu Ozkan</author><author>Robert L. Jernigan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The conformational dynamics of proteins is rarely used in methodologies used to predict the impact of genetic mutations due to the paucity of three-dimensional protein structures as compared to the vast number of available sequences. Until now a threedimensional (3D) structure has been required to predict the conformational dynamics of a protein. We introduce an approach that estimates the conformational dynamics of a protein, without relying on structural information. This de novo approach utilizes coevolving residues identified from a multiple sequence alignment (MSA) using Potts models. These coevolving residues are used as contacts in a Gaussian network model (GNM) to obtain protein dynamics. B-factors calculated using sequence-based GNM (Seq-GNM) are in agreement with crystallographic B-factors as well as theoretical Bfactors from the original GNM that utilizes the 3D structure. Moreover, we demonstrate the ability of the calculated B-factors from the Seq-GNM approach to discriminate genomic variants according to their phenotypes for a wide range of proteins. These results suggest that protein dynamics can be approximated based on sequence information alone, making it possible to assess the phenotypes of nSNVs in cases where a 3D structure is unknown. We hope this work will promote the use of dynamics information in genetic disease prediction at scale by circumventing the need for 3D structures.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>A 3D structure is still required to computationally obtain protein dynamics, drastically limiting the extent to which conformational dynamics can be incorporated into genomic analysis. The reason for this is that there are exponentially more sequences than experimental structures. Currently, UniProtKB contains more than 100 million sequence entries, whereas the PDB reports the number of known 3D structures to be around 140,000 <ref type="bibr">[1]</ref>. Furthermore, the number of known sequences is increasing at an exponential rate, compared to the much slower addition of new experimental PDB structures. This is due to the advent of high-throughput genomic sequencing, which is providing an unprecedented amount of data for genomic analysis. The vast amount of sequence data has driven the rapid classification of novel genetic variations through genome-wide association studies <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>. A large catalogue of non-synonymous single nucleotide variants (nSNVs) occurs in coding regions that can severely impact protein function, potentially leading to disease <ref type="bibr">[4]</ref>. There are many in silico methods developed using evolutionary methodologies such as positional conservation and phylogeny and those that combine evolutionary approaches with biochemical and structural properties to diagnose neutral and disease associated nSNVs <ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref>. However, the accuracy of the majority of these in silico prediction methods is significantly lower for predicting the impact of nSNVs at highly evolving sites <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref>. Protein dynamics can also be used to elucidate the functional impact of nSNVs and mechanisms of disease <ref type="bibr">[5,</ref><ref type="bibr">17]</ref>. Our previous studies have evinced that a site-specific conformational dynamics analysis is capable of diagnosing nSNVs irrespective of evolutionary conservation <ref type="bibr">[5,</ref><ref type="bibr">18,</ref><ref type="bibr">19]</ref> and recently has been incorporated as an additional feature for in silico prediction tools <ref type="bibr">[20]</ref>. However, only a small fraction of the catalogued nSNVs in the coding regions (i.e. missense variants) have 3D experimental structures, <ref type="bibr">[20]</ref>, impeding broad application of protein dynamics in in silico tool predictions.</p><p>Coevolution, on the other hand, has become a valuable tool for its ability to predict structural contacts of 3D structures, particularly using global information through Potts models <ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref>. Coevolving residues are inferred from a multiple sequence alignment (MSA) of a given protein family, whereby if two given amino acids exhibit concordant patterns of evolution throughout the MSA then they are assumed to be in close spatial proximity in the folded 3D structure. This evolutionary principle can be leveraged so that sequence information can be used to describe protein topology, making de novo structure prediction possible <ref type="bibr">[24,</ref><ref type="bibr">27]</ref>. It has been reported that only one correct contact for every 12 residues in a protein is necessary for accurate topology-level modeling <ref type="bibr">[28]</ref>. In addition to structure prediction, coevolution analysis has also been used to identify critical interactions between protein complexes <ref type="bibr">[22]</ref> important functional sites <ref type="bibr">[24]</ref> and allosteric response <ref type="bibr">[29]</ref>. The use of coevolution for structure prediction is largely possible for two reasons. First, the amount of sequence data for different protein families is sufficient to be leveraged by this technique to make predictions. Second, the methods for inferring coevolving residues from an MSA are becoming increasingly robust <ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref><ref type="bibr">[34]</ref>.</p><p>Inferring evolutionary couplings from an MSA are based on two primary approaches categorized as local <ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref> and global approaches <ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref>. The global approaches detangle direct evolutionary couplings from indirect couplings which enables them to capture spatial contacts <ref type="bibr">[40]</ref>. Regardless of the method, the accuracy of detecting coevolving residues that correspond to structural contacts is fundamentally limited by the number of sequence homologs in the MSA. While most of the current methods use only the sequence homologs of the protein family belonging to target sequence, integrating multiple orthology protein families (i.e. families that share similar phylogeny and retain similar functions) was used to increase the number of homologs to produce a more accurate statistical inference <ref type="bibr">[41]</ref>. RaptorX, leverages this joint family methodology; it uses an ultra-deep neural network combining coevolution information with sequence conservation information to infer 3D contacts and has produced higher accuracy than other methods <ref type="bibr">[42]</ref><ref type="bibr">[43]</ref><ref type="bibr">[44]</ref>.</p><p>In this paper, we will demonstrate the efficacy of our novel sequence-based GNM approach, called Seq-GNM, to estimate the dynamics profile of a protein with no a priori knowledge of its 3D structure. This de novo approach based on a Gaussian network model (GNM) enables the prediction of the magnitude of mean-square fluctuations of residues, which are proportional to the B-factors determined by X-ray crystallography experiments. However, instead of using a cutoff distance to determine 3D contacts as does the original structure-based GNM, we use coevolving residues (evolutionary couplings) in our model. We show that the theoretical predictions from our Seq-GNM are in reasonable agreement with experimental crystallographic B-factors as well as the values obtained from the structure GNM models that use spatial contacts. We also extend this analysis to determine the capacity of our model to assess the functional impact of nSNVs. We will demonstrate that the dynamics predicted by Seq-GNM can adequately classify disease and benign nSNVs across the proteome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B-factor Correlations: Sequence, Structure, and Experimental</head><p>We considered a high-resolution protein (2.25 &#197;) that is involved in amino acid catabolism, acyl-CoA dehydrogenase (1JQI), as an example case to examine the B-factor profiles and predicted contact maps using Seq-GNM. Coevolution analysis using direct coupling analysis (DCA) has been shown to recapitulate accurate structural contact maps for a wide range of proteins <ref type="bibr">[21,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">27,</ref><ref type="bibr">31,</ref><ref type="bibr">45]</ref>. As expected, the contact maps of Seq-GNM and structural GNM are similar (Fig <ref type="figure">1</ref>). In a comparison of their B-factor profiles, both Seq-GNM and structural GNM exhibit good agreement with observed B-factors, capturing flexible and rigid positions. Using evolutionary coupling (EC) values obtained from RaptorX, the correlation between the Seq-GNM and observed B-factors is 0.77, whereas the correlation between the structural GNM and observed B-factors is 0.57 (Fig <ref type="figure">1a</ref>). Similarly, using EC values obtained by EVcouplings produced a correlation of 0.60 between the Seq-GNM and observed B-factors (Fig <ref type="figure">1b</ref>). The scores obtained from EVcouplings are still reasonable, yet relatively lower correlations compared to those obtained by the RaptorX. This is likely due to the relatively noisy contact map predictions by EVcouplings compared to the more reliable contact maps produced by RaptorX (we think this is due to their inclusion of multiple orthology protein families) <ref type="bibr">[42]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig 1. B-factors plots.</head><p>A plot of theoretical B-factors as calculated by our Seq-GNM (blue), the original GNM obtained from structure (orange), and observed experimental B-factors (black) for acyl-CoA dehydrogenase (PDB id: 1JQI) along with predicted contact maps by Seq-GNM (using a threshold score, shown as blue) and the contact map of the structure (using 10&#197; cut-off distance) (a) The Seq-GNM with values obtained from RaptorX produced a correlation of 0.56 with experiment, and 0.77 with the GNM obtained from structure. Moreover, the contact maps reveal the predicted contacts between the Seq-GNM and structural GNM approaches are remarkably similar. (b) The Seq-GNM that uses values obtained from EVcouplings produced a correlation of 0.60 with experiment, and 0.68 with the GNM obtained from structure. The B-factor obtained by applying GNM to the experimental structure yields a correlation of 0.57. The contact map captures the dominant contacts with noise coming from poorly predicted EVcouplings scores</p><p>The Seq-GNM produces a correlation with crystallographic B-factors of 0.60, which is within the same range as those produced by the GNM from structure of 0.57. Moreover, theoretical B-factor profiles obtained from both methods were able to identify the catalytic sites on all of the proteins. protein, while the right panel shows the theoretical values predicted by the Seq-GNM. We investigated whether secondary structure was a factor in how the B-factors were distributed across the protein, and if certain secondary structure domains would exhibit less agreement with experiment. In this context, the proteins were selected so that they had a variety of secondary structure components-2JAO contains primarily alpha helices, 1UMK is mainly composed of beta-sheets, and 1F2J is a combination of alpha helices and beta-sheets. For 2JAO, the exterior helices that are flexible (red) in the observed structure are all reproduced in the predicted structure. The one highly rigid (blue) helix in the observed structure was more flexible in the predicted structure but was still in overall agreement. There is a surprising amount of similarity between the observed and predicted structure of 1F2J, considering that it contains both alpha-helix and beta-sheet elements.</p><p>Similarly, 1UMK showed good agreement, except for some miniscule differences. This gives further evidence that the magnitudes of residue fluctuations predicted by the Seq-GNM model is representative of the crystallographic B-factor profiles for many proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig 2. Color-coded ribbon diagrams using experimental and theoretical B-factors obtained by Seq-GNM.</head><p>The observed crystallographic B-factors (left) and the predicted B-factors from the Seq-GNM superimposed on the structure. The three proteins selected-2JAO, 1F2J, and 1UMK-were highresolution structures are better than 2.0 &#197;. The B-factors are color-coded according to their Bfactor profile on a spectrum of blue-white-red where blue represents the lowest B-factors (less mobility) and red represents the highest B-factors (more mobility). The B-factor scores were converted to a percentile rank so that they could be compared across different proteins. Each protein was also rotated 180&#176; so that both sides could be visualized and compared. Moreover, the proteins were selected so that they had a variety of secondary structure components-2JAO contains primarily alpha helices, 1UMK is mainly composed of beta-sheets, and 1F2J is a combination of alpha helices and beta-sheets.</p><p>In order to compare predicted B-factors with crystallographic B-factors, we extracted a subset of 39 structures that had a resolution better than 2.0&#197; to obtain more realistic crystallographic B-factors (unreliable B-factors are common for many PDB structures) <ref type="bibr">[18,</ref><ref type="bibr">46]</ref>. The same cutoff of 2.0&#197; was used in an earlier study to compare GNM predicted B-factors with those determined by crystallography <ref type="bibr">[47]</ref>. For all 39 structures, the Seq-GNM (using EC values from RaptorX) and structure GNM were used to estimate Seq-GNM had low correlations, the EC threshold could be tuned to yield much higher correlations. If this were done on a case-by-case basis, the overall correlation distributions would be even more similar. Thus, the EC threshold may be used as a tuning parameter to enhance the correlation coefficient for purposes of model optimization. analysis that co-evolution can identify positions of protein interfaces and protein-protein interaction partners and successfully reconstruct protein complexes and interaction network <ref type="bibr">[23,</ref><ref type="bibr">30,</ref><ref type="bibr">48]</ref>. Thus, it is not surprising to see that it yields good correlations with the experimental B-factors. Conversely, predicted B-factors from structure can only improve when the oligomeric structure is used for the GNM analysis. Even when using high-resolution X-ray structures, there is still some uncertainty about the realistic nature of crystallographic B-factors. For this reason, we thought a more plausible way to determine the efficacy of the Seq-GNM was to compare it directly with the structure GNM. The structure GNM is a robust method to describe thermal fluctuations in a protein, and in many cases, it performs as good or better than the ANM or MD <ref type="bibr">[47,</ref><ref type="bibr">49]</ref>.</p><p>We systematically evaluated the performance of the Seq-GNM and structure GNM for the entire set of 139 structures and obtained the correlation coefficients for each protein (Fig</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5).</head><p>The average correlation of B-factors between the Seq-GNM and structure GNM model is 0.63 when using EC contacts from RaptorX and 0.43 when using contacts from EVcouplings. As seen in Fig <ref type="figure">5a</ref>, the distribution of correlation coefficients increases until 0.8, and then subsequently decreases. Interestingly, there are still an appreciable number of sequences yielding high correlations from 0.8 to 1.0. A distinguishing feature of the distribution is the pronounced peak in the bin from 0.7 to 0.8, indicating that significant fraction of our data set yields high correlations between 0.7 and 0.8. This is evidence that the Seq-GNM is efficiently capturing protein dynamics and supports the theory that ECs can be used as a substitute to 3D structure contacts in the GNM and still produce reliable dynamics profiles. The results of Seq-GNM based on contacts predicted by RaptorX usually yields B-factors that are closer to experimental B-factors as it uses structural information in its neural networks leading to better EC values and correlations with structure <ref type="bibr">[44]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assessing nSNV Phenotypes Using the Seq-GNM</head><p>Crystallographic B-factors have previously been used to assess the impact of nSNVs on protein function <ref type="bibr">[18,</ref><ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref>. A study <ref type="bibr">[51]</ref> found that mutations on lysozyme that impaired function exhibited lower than average temperature factors, suggesting that rigid sites on the protein are more susceptible to destabilizing nSNVs than flexible sites <ref type="bibr">[55]</ref>.</p><p>Another study revealed a relationship between crystallographic B-factors and the impact of nSNVs on protein function <ref type="bibr">[56]</ref>. A commonly used tool to diagnose neutral and disease associated nSNVs, PolyPhen-2, uses evolutionary information, structural information, and crystallographic B-factors in its prediction model <ref type="bibr">[49]</ref>. These studies indicate that crystallographic B-factors can be used to predict the tolerance of a given residue to an nSNV (i.e., whether or not the occurrence of an nSNV would impact function).</p><p>We investigated whether B-factors predicted by the Seq-GNM were indicative of biological phenotype for nSNVs in the human population. A total of 738 nSNVs were mapped to the 139 enzymes, where 436 are disease-associated and 302 are neutral. S1</p><p>Table shows the number of disease and neutral nSNVs that occur on each protein. The Seq-GNM (using EC contacts from RaptorX and EVcouplings) was computed systematically for all 139 enzymes to obtain their dynamics profiles. The theoretical Bfactors scores were converted into a percentile rank so that the values could be compared across different proteins.</p><p>We initially looked at two human enzymes, human lysozyme (PDB: 1C7P) and human cytochrome reductase (PDB: 1UMK). They were chosen because they were short proteins that each contain a disease and neutral nSNV. Human lysozyme is a glycoside hydrolase that functions in the immune system by causing damage to cell walls of bacteria. Human cytochrome b5 reductase is involved in many oxidation/reduction reactions including converting methemoglobin to hemoglobin <ref type="bibr">[55]</ref>.</p><p>Each structure is color-coded according to its theoretical B-factor profile on a spectrum of blue-white-red. Sites that exhibit high mobility (flexible) are red, and sites that have low mobility (rigid) are blue. Regions that are characterized by low mobility are usually important for maintaining stability and function, thus a mutation could act to destabilize the protein and impair its function. Fig <ref type="figure">6a</ref> show the disease mutation I56T occurring on a rigid site with a B-factor of 0.0075. The neutral mutation T70N has a Bfactor of 0.96 indicating that it is a highly mobile site. Both I56T and T70N occur on loop regions. Although loops are generally more flexible, three alpha-helical domains encompass the loop containing I56T, which implies that it may be involved in interactions that contribute to stabilizing the functional conformation. Thus, the I56T mutation may disrupt these critical interactions and impair the enzymatic function. In the case of cytochrome reductase (Fig <ref type="figure">6b</ref>), the disease mutation R57Q is also on a rigid site with a B-factor of 0.14. Instead of being located near the core, R57Q is highly exposed protruding outwardly from a beta-barrel. However, since beta-barrels often harbor functional residues, the R57Q mutation may disrupt certain interactions critical for modulating function. The neutral mutation T116S is located on a loop and has a B-factor of 0.96, indicating that is it has a high mobility. In our earlier proteome wide study of over 100 human protein structures, we have shown that sites that are highly flexible (e.g., loop regions, or superficial sites) are typically more robust to mutations. Conversely, rigid sites are more susceptible to mutations that may disrupt function <ref type="bibr">[18,</ref><ref type="bibr">19]</ref>. For these two cases, the B-factors produced by Seq-GNM successfully distinguished between the disease and neutral nSNVs, without using the 3D structures. These findings prompted us to analyze the proteome-wide set of 139 enzymes to determine if the B-factors were indicative of phenotype for all 436 disease and 302 neutral nSNVs. The raw B-factor values were converted into a percentile rank (%B-factor) and then binned into 5 bins of size 0.2. We computed the observed-to-expected ratio of Bfactors, where the expected values were based on the B-factor distribution of all 51,618 sites across all 139 proteins, and the observed values were based on the B-factors of the 436 disease sites. The same process was done for the 302 neutral nSNVs. Under the null hypothesis that predicted B-factor of the disease associated nSNVs yields similar distribution of all the positions gathered from 139 enzyme sequences, the ratio of expected and observed sites harboring disease mutations for each %B-factor bin should be close to 1, which would imply that B-factor does not distinguish sites that are prone to disease. This is the null hypothesis that disease sites are distributed uniformly between sites with low and high mobility. However, the null hypothesis was rejected for the 436 disease nSNVs (p &lt;0.001). Fig <ref type="figure">7</ref> shows the observed-to-expected ratio plot of disease and neutral nSNVs, which indicates that disease nSNVs are overabundant at low %Bfactor sites (&lt;0.4) and under abundant at high %B-factor sites. Conversely, neutral nSNVs are overabundant at high %B-factor sites (&gt;0.6) and under abundant at low %B-factor sites. This evidence suggests that the occurrence of an nSNV on a site with a low B-factor is likely damaging based on the position irrelative of the substitution. This is in agreement with our previous proteome-wide study showing that substitutions at rigid sites are more often associated with diseases <ref type="bibr">[18]</ref>. Conversely, an nSNV on a high B-factor site is usually benign. Low B-factors usually signify a residue that is crucial for modulating functional motions (e.g., a hinge). Thus, mutations on these sites can severely impact function. High B-factor sites are more flexible (e.g., loops) and more robust to mutations. disease and neutral nSNVs using co-evolution obtained from only multiple sequence alignment. Moreover, it can be used as an additional feature for in silico predictions <ref type="bibr">[12]</ref>. 0.76, this is increased to 0.81 after including the B-factors of Seq-GNM (Fig <ref type="figure">8c-d</ref>). This result also demonstrates the efficacy of Seq-GNM in disease prediction as a complementary metric to other metrics used as features in classifiers.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>While we and others <ref type="bibr">[5,</ref><ref type="bibr">19,</ref><ref type="bibr">[59]</ref><ref type="bibr">[60]</ref><ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref> have shown that the integration of conformational dynamics into genomic analysis will help next generation of approaches to predict the impact of novel missense mutations on the human proteome, the inherent limitations in availability of 3D structures compared to the vast number of sequences must be overall accuracy to 0.78. This analysis demonstrates that the Seq-GNM makes it possible to obtain estimates of dynamics without using a 3D structure, which will allow for the integration of conformational dynamics into large-scale analysis of genomic variants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset</head><p>A curated set of 139 structures was selected for several reasons. First, they have high query coverage (&gt;80%) and sequence identity (&gt;80%) as found from a BLAST search, and the structures had already been modeled using the Modeller software package <ref type="bibr">[64]</ref> to account for any missing residues. Second, genetic variants were previously mapped onto these structures, such that the positions containing known nSNVs were already determined, enabling us to easily compare our results using sequence coevolution with the genetic variation data. A total of 738 genetic variants were obtained from the HumVar database <ref type="bibr">[58]</ref>, which was comprised of 436 disease and 302 neutral nSNVs. Finally, the structures were either monomers or the single-chain unit of a multimer with &lt;600 residues, allowing for tractable calculations of residue coevolution using the RaptorX web server <ref type="bibr">[42,</ref><ref type="bibr">44]</ref>, and EVfold (EVcouplings) <ref type="bibr">[21]</ref>. A table</p><p>are connected by springs with a single-parameter harmonic potential. In this structurebased GNM, the interacting residue pairs within the cutoff range are represented as contacts in the Kirchhoff (connectivity matrix).</p><p>In the proposed sequence-based GNM (Seq-GNM) approach we will instead use coevolving residue pairs (evolutionary couplings) as contacts in the Kirchhoff. In this way, the 3D structure is no longer a prerequisite to form a GNM. To construct the Kirchhoff, a threshold is defined where any evolutionary coupling scores above that threshold are sufficiently coupled such that they are spatially close in 3D structure. If a given evolutionary coupling pair meets the threshold criteria, it is assigned a value in the Kirchhoff for non-bonded contacts of -1 multiplied by its evolutionary coupling score (i.e., -1&#215;ECscore). This will permit that the strength of each connection will attenuate proportionally to the evolutionary coupling strength. The Kirchhoff can be decomposed into the individual contributions from the bonded contacts representing the chain connectivity (Rouse chain) and that from the non-bonded contacts <ref type="bibr">[56]</ref>. In the Seq-GNM the contribution of non-bonded contacts to the Kirchhoff is constructed according to</p><p>For the local chain connectivity (Rouse chain), we don't take into account evolutionary couplings, and matrix was constructed such that every residue pair i, i &#177; 1 to i, i &#177; 3 is in contact as</p><p>Then the overall Kirchhoff is the combination of the two contributions &#120548; &#119894;&#119895; = &#120548; &#119894;&#119895; &#119888;&#119888; + &#120548; &#119894;&#119895; &#119899;&#119887; . The vibrational dynamics due to thermal fluctuations can then be evaluated in the same way as the original GNM by inverting the Kirchhoff matrix. The magnitude of meansquare fluctuations is then written in terms of the inverse Kirchhoff as</p><p>This is proportional to the Debye-Waller temperature factors or B-factors, which describe the attenuation of X-ray scattering due to the thermal motions of atoms (&#119861; &#119894; = 8&#120587; 2 &#10216;(&#120549;&#119929; &#119894; ) 2 &#10217; 3 &#8260; ). Here there is no single-parameter force constant as in the GNM obtained from structure <ref type="bibr">[52]</ref>, and the pair-wise interactions are simply the strength of the evolutionary couplings as given by their ranked scores. The theoretical predictions of our Seq-GNM can be compared to the predictions of the original GNM obtained from structure as well as observed crystallographic B-factors. A general workflow of our method is presented as a flow diagram in Fig <ref type="figure">9</ref>. A workflow of our method to use predicted evolutionary couplings to determine protein dynamics and assess the functional impact of nSNVs. The initial input is an amino acid sequence, which is used to obtain MSA. Using MSA evolutionary coupling pairs are predicted through RaptorX and EVcouplings. The high scored evolutionary coupling pairs are assigned as contacts in our Seq-GNM to compute the dynamics profile of each protein. The dynamic profiles obtained from Seq-GNM can give insight into the functional impact of nSNVs. This was done for a curated set of 139 structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimizing Threshold Value for EC Scores</head><p>To ensure consistency when analyzing different proteins with varying lengths, we converted the raw scores of evolutionary couplings (EC) into a percentile rank. We computed the Seq-GNM for all 139 structures using a constant threshold percentile rank EC value to assign contacts and measured the correlation between the B-factors predicted by our Seq-GNM to the GNM obtained from structure. We used only the top  EVcouplings with that of the structural GNM for all 139 structures using a constant threshold for EC contacts. The GNM analysis was conducted 8 times using a constant threshold (between 0.92 and 0.99) each time. The best average correlations were produced when the constant threshold value of 0.98 was used.</p></div></body>
		</text>
</TEI>
