<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Suggestion of Converting Protein Intrinsic Disorder to Structural Entropy Using Shannon’s Information Theory</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10137153</idno>
					<idno type="doi">10.3390/e21060591</idno>
					<title level='j'>Entropy</title>
<idno>1099-4300</idno>
<biblScope unit="volume">21</biblScope>
<biblScope unit="issue">6</biblScope>					

					<author>Hao-Bo Guo</author><author>Yue Ma</author><author>Gerald Tuskan</author><author>Hong Qin</author><author>Xiaohan Yang</author><author>Hong Guo</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose a framework to convert the protein intrinsic disorder content to structural entropy (H) using Shannon’s information theory (IT). The structural capacity (C), which is the sum of H and structural information (I), is equal to the amino acid sequence length of the protein. The structural entropy of the residues expands a continuous spectrum, ranging from 0 (fully ordered) to 1 (fully disordered), consistent with Shannon’s IT, which scores the fully-determined state 0 and the fully-uncertain state 1. The intrinsically disordered proteins (IDPs) in a living cell may participate in maintaining the high-energy-low-entropy state. In addition, under this framework, the biological functions performed by proteins and associated with the order or disorder of their 3D structures could be explained in terms of information-gains or entropy-losses, or the reverse processes.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Protein structures have played a central role in molecular biology ever since the "lock-and-key" model for enzymatic catalysis <ref type="bibr">[1]</ref>, along with the sequence-structure-function dogma <ref type="bibr">[2]</ref> and the protein folding landscape theory <ref type="bibr">[3]</ref>. However, many proteins from different organisms have been found to lack stable, well-folded tertiary structures in their native states. These proteins are termed the intrinsically disordered proteins (IDPs) or intrinsically disordered protein regions (IDPRs) <ref type="bibr">[4]</ref>. The term "intrinsic" here means that the quantity is encoded in the protein primary amino acid sequence, on the same ground as the IDP denominator <ref type="bibr">[5]</ref>. The abundance of IDPs and IDPRs, and the vital roles they play in cellular organisms and viruses, has been appreciated in a plethora of publications during the past two decades <ref type="bibr">[6,</ref><ref type="bibr">7]</ref>. Despite the extensive discussions of IDPs in literature, "paradoxes" seem to present in IDP structures and functions, including reduced structural content versus enhanced structural heterogeneity, simplified folding landscapes versus complexified functionalities, etc. <ref type="bibr">[8]</ref> The intrinsic disorder content of each residue in a protein can be predicted from its primary amino acid sequence. It is now well understood that the IDPs lack the hydrophobic cores necessary for successful folding <ref type="bibr">[9]</ref> and that the charged residues (R, K, H, E, and D) are disorder-promoting <ref type="bibr">[10]</ref>. An early study showed the association between the ratios of the numbers of charged over hydrophobic residues and the disorder content for a relatively small set of proteins <ref type="bibr">[11]</ref>; however, this oversimplified Entropy 2019, 21, 591 2 of 14 approach was not predictive for larger protein sets. As representative present algorithms, the Predictor Of Natural Disordered Regions (PONDR) family predictors use artificial neural networks to integrate attributors, including the amino acid composition, sequence complexity, hydropathy, and net charge-and they yield reasonably accurate intrinsic disorder predictions <ref type="bibr">[12]</ref>.</p><p>Developing accurate intrinsic disorder predictors is an endless path <ref type="bibr">[13]</ref>, considering the astronomical size of the protein world-e.g., 20 100 different proteins of 100 amino acids-and the crowded and dynamic cellular environments surrounding the proteins. Quality assessment has been performed among different protein disorder predictors <ref type="bibr">[14,</ref><ref type="bibr">15]</ref>, which adopt the same standard for the intrinsic disorder score, representing the probability of being disordered-i.e., a fully ordered residue scores 0 (0%) and a fully disordered residue scores 1 (100%), respectively. The need for a unified guideline to report a protein intrinsic disorder has also been demanded <ref type="bibr">[16]</ref>. In intrinsic disorder interpretations, ambiguities exist for both amino acids and proteins. For example, if an amino acid has an ID score of 0.5, is it an ordered or disordered residue? Similarly, for a protein with 50% disordered residues, is it an ordered or disordered protein?</p><p>In the present work, we suggest a conversion to the intrinsic disorder contents of the residues in the proteins into entropy contents, similar to the process developed by Shannon over 70 years ago <ref type="bibr">[17]</ref>. In information theory (IT), the Shannon entropy uses a scoring system very similar to the ID predictors: A fully determined state scores 0 and a fully uncertain state scores 1, respectively. It is clear that the intrinsic disorder contents consist of the information that the protein carries. However, a closer examination must be taken to determine if the intrinsic disorder content could be treated as the disorder entropy, or structural entropy. Below we will show that the protein intrinsic disorder (D) cannot be regarded as the structural entropy (H) directly. In addition, a means for converting D to H was proposed that provides a conceptual framework that quantitatively defines the structural information carried by a protein in vitro.</p><p>Hereafter, we will use structural entropy-or simply entropy-in accordance with Shannon's insight that the information and entropy serve as measures of each other: Information is negative entropy, or negentropy (Equation (2) and see below). Note that the term "structural" defines a meaning for the information; however, the semantic aspect of information was originally discarded by Shannon <ref type="bibr">[17]</ref> and later by some molecular biologists <ref type="bibr">[18]</ref>. Still, other biologists argued that meanings or purposes are crucial to biological information <ref type="bibr">[19,</ref><ref type="bibr">20]</ref>. Discussions on syntactics versus semantics in IT can be found in other publications <ref type="bibr">[21]</ref>, but are beyond the scope of this paper.</p><p>A fundamental difference between D and H is that the structural entropy, H, is an additive quantity, whereas the ID score, D, is not (see the Appendix in the supplementary material (SM)). That is, adding up structural entropies of all residues leads to the structural entropy of the entire protein, whereas we cannot infer the ID score of a protein by adding up the ID scores of all its residues. We suggest that the (intrinsic) structural entropy either of a single residue in the protein or of the entire protein can be directly calculated from a disorder predictor. Our suggestion indicates that it is feasible to quantitatively compare structural entropies of two proteins, regardless of their respective disorder percentages, which are rough and qualitative measures of the protein disorder.</p><p>Our suggestion may also help to understand protein structures and functions in the cells quantitatively and provide an easy answer to the paradoxes mentioned above: Reduction in information is equivalent to an increase in entropy or complexity. Under this framework, the processes in a living cell might be quantitatively viewed as the procedures for maintaining the low entropy levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Shannon's Equation: Structural Entropy (H), Information (I), and Capacity (C)</head><p>Shannon's equation <ref type="bibr">[17]</ref> implies:</p><p>Entropy 2019, 21, 591</p><p>where H(X) is defined as the entropy-or uncertainty-of X, which is any source of information with the number of states n and p i is the probability of the i-th state of X. K is an arbitrary scaling factor, which means that the base of the logarithm function could vary depending on the unit of the entropy. For example (set K = 1), a base of two corresponds to the unit of the binary digit (bit), and the natural base (ln) leads to the unit of the natural unit (nat). The maximum value of H(X) H max (X), determines the upper limit or the capacity of X, C(X), which Shannon considered the channel capacity for communications <ref type="bibr">[17]</ref>. The second half of Equation (1), though not explicitly stated by Shannon, indicates that, for a given capacity C(X), the entropy H(X) could serve as a measure of the information I(X) that one has received-that is, &#8710;I(X) = -&#8710;H(X).</p><p>(2)</p><p>As a consequence, the information change &#8710;I was termed negentropy (negative entropy) <ref type="bibr">[22]</ref>, which is what Maxwell's demon receives to generate order out of disorder <ref type="bibr">[23]</ref>.</p><p>To connect the intrinsic disorder contents obtained by regular disorder predictors, the structural entropy of a protein X with a length of L is defined as:</p><p>where d i is the intrinsic disorder content of the i-th residue.</p><p>In the fully disordered state (d i = 1 for all residues), the protein exhibits the maximal entropy, expressed as the structural capacity of:</p><p>which carries zero structural information I(X). When all residues are fully ordered (d i = 0 for all residues), H(X) = 0 since lim x&#8594;0</p><p>x log x = 0, and the I(X) reaches the maximum of L. Normally, 0 &#8804; d i &#8804; 1 and 0 &#8804; H(X) [or I(X)] &#8804; L. The criteria for using the logarithmic measure suggested by Shannon <ref type="bibr">[17]</ref> then holds. In Equation (3), each residue is in a two-state system, with the probabilities x i in one state and (1 -x i ) in the other state, and the total number of states is n = 2 L . More details and potential alternatives to this approach have been summarized in the Appendix in Supplementary. Equation <ref type="bibr">(4)</ref> indicates that the structural capacity (C) of a protein is its amino acid (AA) sequence length (L). In addition, it had been reported that C (protein length L was used in the original report, but since it is equivalent to C, only C will be used hereafter to prevent unexpected confusion) has an exponential distribution or a gamma distribution <ref type="bibr">[24]</ref>. We used three models, the exponential, gamma, and power law distributions, to fit the structural capacities arranged in hierarchical ranks from the smallest to the largest (Table <ref type="table">S1</ref>). The exponential model provides the best overall fit to the structural capacities in all proteomes studied in the present work. The gamma model, in some cases, produces a better fit than the exponential model, but it fails to recapture the small-C regions for the two animal proteomes (human and fruit fly). The power law model fits the small-C but fails at the large-C regions. Therefore, the exponential model is the best choice here to describe the protein structural capacity distributions, exemplified in Figure <ref type="figure">1A,</ref><ref type="figure">D</ref>. Fittings of all proteomes studied here are summarized in Table <ref type="table">S1</ref> in the SM.</p><p>An exponential fit of C implies a linear fit of logC (with Cs ranked hierarchically), which may therefore be a better measure of the protein-capacity distributions in proteomes <ref type="bibr">[24]</ref>. Interestingly, with a bin size of &#8710;log 2 C = 1, Gaussian distributions can be observed in logC of all proteomes exemplified in Figure <ref type="figure">1B,</ref><ref type="figure">D</ref>. The underlying mechanisms of this Gaussian distribution have not been discussed in the literature; however, the distribution density can also be understood as the potential of mean force, or free energy, using the relationship:</p><p>where R is the gas constant, T is the temperature, and &#961; i is the density of proteins at the i-th histogram of protein capacities. A preferential free-energy well (insets in Figure <ref type="figure">1B</ref>,D) may present in the proteome, confining the protein structural capacity distributions. The locations and depths of this well vary among organisms: For the human proteome, the well depth is 3.16 kcal/mol centered at C = 592, whereas for JCVI-Syn3.0 the well depth is 1.70 kcal/mol centered at C = 401.</p><p>where di is the intrinsic disorder content of the i-th residue.</p><p>In the fully disordered state (di = 1 for all residues), the protein exhibits the maximal entropy, expressed as the structural capacity of:</p><p>which carries zero structural information I(X). When all residues are fully ordered (di = 0 for all residues), H(X) = 0 since lim &#8594; &#119909;log&#119909; = 0, and the I(X) reaches the maximum of L. Normally, 0 &#8804; di &#8804; 1 and 0 &#8804; H(X) [or I(X)] &#8804; L. The criteria for using the logarithmic measure suggested by Shannon <ref type="bibr">[17]</ref> then holds. In Equation ( <ref type="formula">3</ref>), each residue is in a two-state system, with the probabilities xi in one state and (1 -xi) in the other state, and the total number of states is n = 2 L . More details and potential alternatives to this approach have been summarized in the Appendix.  <ref type="table">S1</ref> of the Supplementary Material (SM).</p><p>Equation <ref type="bibr">(4)</ref> indicates that the structural capacity (C) of a protein is its amino acid (AA) sequence length (L). In addition, it had been reported that C (protein length L was used in the original report,  <ref type="table">S1</ref> of the Supplementary Material (SM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Entropy-to-Information Ratio, R</head><p>The extremely dynamic environment in the cell renders difficulties in detection or prediction of the changes of the structural entropies and structural information. However, the sum of both quantities equals the structural capacity that is the same as the AA sequence length, and the information gain always equals the entropy loss and vice versa. Hence, it is useful to know the relationships between the intrinsic structural entropies and the information that can be derived (Equation ( <ref type="formula">3</ref>)) from the residual disorder contents, which, in turn, can be predicted based on the protein amino acid sequences.</p><p>Here, we define the entropy-to-information ratio, R, as R = H:I. R ranges from 0 (for a fully ordered protein with H = 0 and I = C) to &#8734; (for a fully disordered protein with H = C and I = 0). When R &gt; 1, the protein is entropy rich (H &gt; I), and when R &lt; 1, the protein is information rich (H &lt; I).</p><p>We noticed no apparent correlation between C and R for all proteomes-e.g., the Pearson correlation coefficients between C and R is -0.03 for both H. sapiens and Arabidopsis thaliana, and a similar trend can be found in other proteomes studied in the present work. This lack of correlation permits the construction of a protein space with two attributes C and R: The former represents the sum and the latter represents the ratio of the structural entropy (H) and information (I), respectively. This space is termed the protein CR-space hereafter. In the CR-space, C i sets the upper limit and R i gives the intrinsic ratio between the two quantities of the total H i and I i of the i-th protein P i . Based on the Gaussian distribution in both log 2 C and log 2 R, the protein distributions in the CR-space are represented in Figure <ref type="figure">2</ref>, for proteins expressed in representative prokaryotes (bacteria and archaea, Figure <ref type="figure">2A</ref>), eukaryotes (Figure <ref type="figure">2B</ref>), viruses (Figure <ref type="figure">2C</ref>), and datasets collected from the DisProt database [25] and protein data bank (PDB) <ref type="bibr">[26]</ref> (Figure <ref type="figure">2D</ref>). similar trend can be found in other proteomes studied in the present work. This lack of correlation permits the construction of a protein space with two attributes C and R: The former represents the sum and the latter represents the ratio of the structural entropy (H) and information (I), respectively. This space is termed the protein CR-space hereafter. In the CR-space, Ci sets the upper limit and Ri gives the intrinsic ratio between the two quantities of the total Hi and Ii of the i-th protein Pi. Based on the Gaussian distribution in both log2C and log2R, the protein distributions in the CR-space are represented in Figure <ref type="figure">2</ref>, for proteins expressed in representative prokaryotes (bacteria and archaea, Figure <ref type="figure">2A</ref>), eukaryotes (Figure <ref type="figure">2B</ref>), viruses (Figure <ref type="figure">2C</ref>), and datasets collected from the DisProt database [25] and protein data bank (PDB) <ref type="bibr">[26]</ref> (Figure <ref type="figure">2D</ref>).  <ref type="bibr">[25,</ref><ref type="bibr">26]</ref>. Each protein i is represented by a black dot. The horizontal axis is C i = H i + I i and the vertical axis is the ratio between H i and I i such that R i = H i /I i . The horizontal dashed line (red) corresponds to R i = 1; proteins above this line are entropy-rich with H i &gt; I i , and those under this line are information-rich with H i &lt; I i . The vertical dashed line serves as a reference, using the median capacity of the human proteome at C i = 417. The total H-to-I ratios (&#931;H:&#931;I) are given in the maps. Note that the boundaries of the horizontal (R) and vertical (C) axes vary across different proteomes and/or protein sets.</p><p>Table <ref type="table">1</ref> summarized the protein distributions in the CR-space of all proteomes and datasets, and more discussions will be given as follows. All data are represented using (C, R), where C is the structural capacity (vertical axis in Figures <ref type="figure">2</ref> and<ref type="figure">4</ref>) and R is the H-to-I ratio (horizontal axis in Figures <ref type="figure">2</ref> and<ref type="figure">4</ref>), respectively. The data for proteins with extreme C's or R's are given.</p><p>For those with multiple entries, a range for C or R and the total number of proteins is listed. The infinite (&#8734;) values in R come from the totally disordered proteins with I = 0 and H = C, respectively. N represents the total number of different proteins in the proteome or dataset; for eukaryotic proteomes, only the primary protein at each gene locus is counted. The last column gives the total H divided by total I of all proteins in the proteome or dataset. No average R's are given because there are proteins with R of &#8734; in all eukaryotes, datasets, and the archaea Lokiarchaeum and Nanoarchaeum equitans. Details of the names of organisms and datasets can be found in the Methods section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Prokaryotes vs. Eukaryotes</head><p>Figure <ref type="figure">2A</ref>,B show the protein distributions of prokaryotic and eukaryotic proteomes, respectively. The total number of proteins vary from the smallest synthesized bacterium JCVI-Syn3.0 that has only 438 proteins <ref type="bibr">[27]</ref> to the monocot plant Oryza sativa (rice) that has 48,782 proteins <ref type="bibr">[28,</ref><ref type="bibr">29]</ref>. A significant difference between prokaryotes and eukaryotes is the overall R (&#931;H:&#931;I) of the proteomes: Values for prokaryotes range from 1.29 (Nanoarchaeum equitans) to 1.57 (Ignicoccus hospitalis) and values for eukaryotes range from 2.10 (Amborella trichopoda) to 2.53 (H. sapiens).</p><p>Table <ref type="table">1</ref> also shows the extreme C and R of all proteomes. The most distinct difference between prokaryotes and eukaryotes is the number of totally disordered proteins that have R max = &#8734;, which is induced by I = 0 (H = C) of these proteins. In the prokaryotes, only one such protein can be found in Lokiarchaeum <ref type="bibr">[30]</ref>, which exhibits strong potential to represent the ancestral host archaeon that accidentally swallowed the bacterium (Rickettsiale being a potential relative <ref type="bibr">[31]</ref>), which fortunately turned into a symbiont and eventually evolved to the mitochondria of eukaryotic cells. All eukaryotic proteomes, however, contain fully disordered proteins, from two in Saccharomyces cerevisiae to 37 in H. sapiens.</p><p>It should be mentioned that there are 25 seleno-proteins in the human proteome, which has an unusual type of residue of selenocysteine (one letter code U, or three letter code Sec), though the disorder contents of these proteins cannot be predicted by the current version of PONDR predictors (see the SM for a list of the human seleno-proteins). These proteins were not included in the final analysis of the current study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Viruses</head><p>Figure <ref type="figure">2C</ref> shows the protein distributions of two giant DNA viruses (giruses) and two RNA viruses. The R's of viruses vary significantly from 0.96 (human coronavirus) to 2.09 (Pandoravirus).</p><p>Although the RNA viruses generally contain less than 10 protein-encoding genes, the protein intrinsic disorder-i.e., structural entropy-may be reflective of immune evasion and virus-host interactions, especially for the Ebola virus, in which all proteins are entropy-rich and have R &gt; 1 <ref type="bibr">[32]</ref>. On the other hand, the giruses have proteomes similar to those of some cellular organisms in both total protein numbers and their distributions in the CR-space. For example, &#931;H:&#931;I of Mimivirus is 1.56, close to that of the prokaryotes, and &#931;H:&#931;I of Pandoravirus is 2.09, close to that of some of the eukaryotes, such as the basal angiosperm A. trichopoda. Moreover, Pandoravirus has a fully disordered protein with a capacity of 29.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Datasets</head><p>The present work focuses on the intrinsic disorder and omits the environmental cues, which will certainly have great impact on determinations of the structural information and entropies. In contrast, the proteins collected in the datasets (DisProt [25] and PDB <ref type="bibr">[26]</ref>) have been validated (or somewhat biased) experimentally. The protein distributions of the selected datasets are shown in Figure <ref type="figure">2D</ref>.</p><p>Not surprisingly, the proteins from the DisProt database exhibit the highest &#931;H:&#931;I of 2.99. However, several proteins (8 out of 803, or 1%) have R &lt; 1. This difference may be caused either by the protein disorder prediction methods used here or by inherent limits of the validation protocol of the database. The eight information-rich proteins from the DisProt database, as well as their sequences, have been listed in the SM. Apparently, the human Titin <ref type="bibr">[33]</ref> protein (C = 34,350, R = 2.52) is collected in DisProt. However, a closer look finds that that protein in DisProt points to the nuclear magnetic resonance (NMR) structure (PDB entry 1BPV) of the type-1 module of the Titin protein (C = 112, R = 2.30). Nevertheless, the Titin protein does have an extremely long IDPR based on our calculation (Figure <ref type="figure">S1</ref> in the SM).</p><p>We further analyzed the protein sequence sets from the PDB. After removing redundant entries, each sequence in the following datasets represents a unique sequence: NMR set for sequences of structures collected by solution NMR (8,476 entries), X-ray 1.5 set for sequences from X-ray crystallographic structures with resolution higher than 1.5 &#197; (5,842 entries), and X-ray 3.0 set for sequences from X-ray structures with resolution lower than 3.0 &#197; (10,636 entries). The NMR method largely limited the size of proteins with the majority of Cs smaller than 200. However, the structures detected by NMR generally have higher flexibility, leading to larger R's (&#931;H:&#931;I = 2.65). The X-ray 1.5 and X-ray 3.0 sets have &#931;H:&#931;I of 1.78 and 1.73, respectively, but no significant difference was observed between these two datasets. It seems that the structural entropies of protein structures deposited in the PDB are method sensitive but are not determined by the resolution of the crystallography.</p><p>It is worth noting that many structures deposited in the PDB (especially those solved by X-ray crystallography) have missing residues. However, here we are interested in their sequences that contain the missing residues. We also noticed that there are fully disordered (i.e., R = &#8734;) proteins in all datasets from PDB (Table <ref type="table">1</ref>). It may not be surprising to solve the highly disordered proteins using NMR, especially for relatively small proteins. However, even the X-ray 1.5 set contains fully disordered proteins. We listed the fully disordered proteins with C &gt; 20 in both X-ray 1.5 and X-ray 3.0 sets in the SM (Tables <ref type="table">S2</ref> and<ref type="table">S3</ref>). First, these proteins are relatively short with the longest one (1JCD, X-ray 1.5 set) having 52 residues. It may be possible that the PONDR predictor did not perform well for these sequences, particularly the collagen sequences (Tables <ref type="table">S2</ref> and<ref type="table">S3</ref>). On the other hand, all listed proteins either are subunits of big protein complexes or form oligomers (mostly homo-trimers), suggesting coupled folding and binding may help these IDPs attain the folded structures <ref type="bibr">[34]</ref>. The mutual folding induced by the binding (MFIB) database has collected 205 IDPs from PDB (both X-ray and NMR structures), with their folding induced by binding <ref type="bibr">[35]</ref>. None of the proteins (Tables <ref type="table">S2</ref> and<ref type="table">S3</ref>) have been collected in the MFIB database, but several entries listed in Table <ref type="table">S3</ref> have also been collected in the intrinsically disordered proteins with extensive annotations and literature (IDEAL) <ref type="bibr">[36]</ref> database. Our findings may help expand the collection of MFIB, IDEAL, and similar databases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Random Sequences</head><p>We randomly generated 500 proteins with random capacities in the range <ref type="bibr">[50,</ref><ref type="bibr">800]</ref>. Interestingly, this random set has a &#931;H:&#931;I ratio of 1.020 (Figure <ref type="figure">S5</ref> in the SM), indicating that the total entropy and information are roughly 1:1. This ratio is smaller than those from cellular proteomes shown in Figure <ref type="figure">2</ref> and is close to the viral proteome of human coronavirus (Figure <ref type="figure">2C</ref>), suggesting that when total structural entropy is found to be relatively larger than total structural information in cellular organisms, that relationship is not random.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Evolution of Protein Structural Entropies and Information</head><p>Based on the analysis of cellular organisms and viruses, it seems that the &#931;H:&#931;I ratio may set a boundary at approximately 1.6 to 2.0 for separating prokaryotes and eukaryotes. In addition, if there were such a boundary, the two giant DNA viruses (giruses) studied here would be located at different branches, with Mimivirus close to or within the prokaryotic branch and Pandoravirus close to or within the eukaryotic branch, respectively.</p><p>Previously, by using the protein length L (structural capacity C) and the disorder content D, distributions of proteins in the space with attributes L and D have been used to reconstruct a phylogenetic tree that indicates intriguing evolutionary dynamics associated with the variations of L and D in the proteomes <ref type="bibr">[37]</ref>. A similar approach is used here to study protein distributions in the HI-space.</p><p>The CR-space is split into 5 &#215; 4 blocks (Figure <ref type="figure">3A</ref> and Table <ref type="table">2</ref>). The gene densities <ref type="bibr">[38]</ref> at different blocks are then calculated, and the distance between two organisms A and B is defined using the Euclidean Equation:</p><p>where D AB is the distance between organisms A and B and X ij (X = A or B) is the gene density at the ij-th (1 &#8804; I &#8804; 5, 1 &#8804; j &#8804; 4) block. The calculated distance matrix is converted to a phylogenetic tree, as shown in Figure <ref type="figure">3B</ref>. The RNA viruses are excluded in this tree because both have fewer than 10 proteins and therefore are not statistically significant.</p><p>Entropy 2018, 20, x FOR PEER REVIEW 8 of 13</p><p>and D in the proteomes <ref type="bibr">[37]</ref>. A similar approach is used here to study protein distributions in the HIspace. The CR-space is split into 5 &#215; 4 blocks (Figure <ref type="figure">3A</ref> and Table <ref type="table">2</ref>). The gene densities <ref type="bibr">[38]</ref> at different blocks are then calculated, and the distance between two organisms A and B is defined using the Euclidean Equation:  </p><p>As expected, the phylogenetic tree clearly separates eukaryotes from prokaryotes. This tree also correctly separates both animals (H. sapiens and Drosophila melanogaster) and plants (A. thaliana and O. sativa), whereas the basal angiosperm (A. trichopoda) and moss (Physcomitrella patens) are at the basal locations of the eukaryotic branch. Mimivirus is located at the prokaryotic branch and Pandoravirus is located at the eukaryotic branch, sitting with the moss P. patens. This trend is also expected from the &#931;H:&#931;I ratios. However, similar to our previous study <ref type="bibr">[37]</ref>, for all prokaryotes, the phylogeny failed to separate bacteria from archaea without introducing other attributes. Further investigations, such as considering the domain components in the proteins <ref type="bibr">[39]</ref>, will be required to understand the evolution of the protein entropy.</p><p>The phylogenetic tree presented in Figure <ref type="figure">3B</ref> (tree of proteomes (ToP)) is by no means for capturing the evolution of species that are usually derived based on multi-sequence alignments (MSAs) of genes. However, this ToP might be visualized as the evolution of structural entropies. It seems that higher complexity favors higher entropic content, especially the entropy-to-information ratios, in the proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Structural Entropy of Regulatory Proteins</head><p>Two categories of proteins-the transcription factors (TFs) and kinases-are further examined from three model organisms, Escherichia coli, A. thaliana, and H. sapiens. Both categories are abundant in the organism and play regulatory roles in the cells. Whilst regulations by the TFs often directly affect the expressions or translations of the genes or transcripts through recognitions and/or interactions with the nucleic acids, the kinases perform post-translational regulations by catalyzing the phosphorylation reactions to the target AAs of proteins (often Ser, Thr, or Tyr) through the adenosine triphosphate (ATP) cofactors.</p><p>Figure <ref type="figure">4</ref> shows that the eukaryotes (the plant A. thaliana and animal H. sapiens) possess higher &#931;H:&#931;I ratios than the prokaryote (E. coli), which is consistent with the trend in Figure <ref type="figure">2</ref>. Moreover, the regulatory mechanisms may influence the H:I ratios (R's) of the proteins such that the recognition and interactions with the DNA or RNA targets for the TFs may involve larger amounts of information gains or entropy losses compared to the enzymatic catalysis by the kinases via a lock-and-key-like model. Therefore, we may propose that more intrinsic structural entropies are required for the regulations by TFs than the those required by the kinases, or possibly other enzymes catalyzing different reactions in the cells. These differences are more significant in the eukaryotes (H. sapiens and A. thaliana) than in the prokaryotes (E. coli). Moreover, the A. thaliana TFs have a larger &#931;H:&#931;I ratio compared to those of H. sapiens, indicating that the plant TFs may possess overall higher structural entropies than do the animal TFs.</p><p>from three model organisms, Escherichia coli, A. thaliana, and H. sapiens. Both categories are abundant in the organism and play regulatory roles in the cells. Whilst regulations by the TFs often directly affect the expressions or translations of the genes or transcripts through recognitions and/or interactions with the nucleic acids, the kinases perform post-translational regulations by catalyzing the phosphorylation reactions to the target AAs of proteins (often Ser, Thr, or Tyr) through the adenosine triphosphate (ATP) cofactors. Figure <ref type="figure">4</ref> shows that the eukaryotes (the plant A. thaliana and animal H. sapiens) possess higher &#931;H:&#931;I ratios than the prokaryote (E. coli), which is consistent with the trend in Figure <ref type="figure">2</ref>. Moreover, the regulatory mechanisms may influence the H:I ratios (R's) of the proteins such that the recognition and interactions with the DNA or RNA targets for the TFs may involve larger amounts of information gains or entropy losses compared to the enzymatic catalysis by the kinases via a lock-and-key-like model. Therefore, we may propose that more intrinsic structural entropies are required for the regulations by TFs than the those required by the kinases, or possibly other enzymes catalyzing different reactions in the cells. These differences are more significant in the eukaryotes (H. sapiens and A. thaliana) than in the prokaryotes (E. coli). Moreover, the A. thaliana TFs have a larger &#931;H:&#931;I ratio compared to those of H. sapiens, indicating that the plant TFs may possess overall higher structural entropies than do the animal TFs.  <ref type="figure">2</ref>, each protein is represented by a black dot in the maps, the horizontal axis is for the structural capacity C i, and the vertical axis is for the H:I ratios R i . &#931;H:&#931;I is given for each group of proteins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>Many algorithms have been developed to predict the probabilities of being disordered for amino acids in proteins, which are presented as the disorder contents of the residues; however, it is not straightforward to quantify how "disordered" a protein is using these approaches. Here, we suggest converting the disorder content to structural entropy, which is continuous and additive and therefore makes it feasible to quantitatively compare two proteins in terms of the degrees of structural entropy and/or information they possess.</p><p>As Erwin Schr&#246;dinger noted, "life feeds on negative entropy" <ref type="bibr">[40]</ref>. The nature of the negative entropy (negentropy) is, actually, the information. Our results suggest that proteins and other biomolecules also feed on information. The capacity of every protein is fixed (Equation ( <ref type="formula">4</ref>)), and, therefore, the higher the information content it receives, the lower the entropy it possesses. A recent paper further indicates that the IDPs in the cell are either globally or partially structured. <ref type="bibr">[41]</ref> In other words, the structural entropy carried by the IDPs is fully or partially registered by the structural information from the crowded and dynamic intracellular environment of the living cell.</p><p>It might also be expected that the structural entropy/information proposed here has implications in biological processes where structural information-gain/entropy-loss (or the reverse) occur, such as in gene transcriptions, translations, protein folding pathways, protein-protein or protein-ligand interactions, and enzymatic reactions, to name a few. A recent work <ref type="bibr">[42]</ref> reported the general transcriptional coactivator CBP that binds to two different proteins, the negative feedback regulator CITED2 (target T1) and the hypoxia-inducible factor HIF-1&#945; (target T2). The binding affinities of CBP to both substrates are very close. However, even at a modest concentration, the substrate T1 rapidly substitutes T2 in binding to CBP. In addition, it was shown that T1 exhibits higher disorder contents-and therefore higher structural entropy-than T2. Under our approach, the higher specificity of T1 can be understood as the higher information-gain of CBP upon binding to T1 than that of binding to T2.</p><p>This approach might be extended to DNA, RNA, and other biomolecules in cases where their entropies and information could be estimated. It should be emphasized here that to quantify the information or entropy, the meaning of the information might be needed. This may especially be the case for the biological information. For instance, the meaning of the information discussed in the present work is the structure of proteins-or, more precisely, the order-to-disorder continuum in the protein world.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Methods</head><p>Proteomes of organisms from all three domains of life have been surveyed in the present work. Three bacteria include the alphaproteobacterium Rickettsiales bacterium Ac37b <ref type="bibr">[43]</ref>, which may be a close precursor of eukaryotic mitochondria, the cyanobacterium Synechococcus elongatus PCC 7942 <ref type="bibr">[44]</ref>, which may represent the precursor of plant plastids, and, recently synthesized, the smallest known free-living organism JCVI-Syn3.0 <ref type="bibr">[27]</ref>. Three archaea include the smallest known free-living archaeon I. hospitalis; its parasite, the smallest known archaeon N. equitans <ref type="bibr">[45]</ref>; and Lokiarchaeum (sp. GC14_75) <ref type="bibr">[30]</ref>. The latter was reported to possess eukaryotic genes, such that its close relative, may have served as the host to accommodate an obligate parasitic bacterium, and the archaeon and bacterium accidently merged to produce the first eukaryotic cell some 2 billion years ago. The eukaryotic organisms include three plant species: the basal angiosperm Amborella trichopoda <ref type="bibr">[46]</ref>, the monocot plant O. sativa <ref type="bibr">[29]</ref> (rice), and the eudicot and model plant A. thaliana <ref type="bibr">[47]</ref>; two animal species: H. sapiens <ref type="bibr">[48,</ref><ref type="bibr">49]</ref> (human) and D. melanogaster <ref type="bibr">[50]</ref> (fruit fly); the moss P. patens <ref type="bibr">[51]</ref>; and the yeast S. cerevisiae <ref type="bibr">[52]</ref>. Several viral proteomes were also studied, including two RNA viruses with less than 10 genes from the Ebola virus and human coronavirus (which causes the common cold), plus two giant DNA viruses (giruses), the Mimivirus <ref type="bibr">[53]</ref>, and Pandoravirus (or P. salinus) <ref type="bibr">[54]</ref>. In addition, protein sequences from the database of experimentally confirmed disordered proteins, DisProt (v7.0) [25], and those from the protein data bank <ref type="bibr">[26]</ref> (PDB, up to December 16, 2016) are also analyzed. The redundant sequences from the PDB have been removed, and each entry represents a unique protein sequence in the analysis: For each of the protein sets, if there are multiple identical copies of a sequence, such as those from identical chains from a homo-dimer or other oligomers, only one unique sequence is retained for further analysis.</p><p>The PONDR-VSL2 algorithm <ref type="bibr">[55]</ref> was applied to predict the ID content of all residues in a protein. The obtained disorder content was then converted to probabilities that associated with the structural entropy, using Equation 3. A neighbor-joining method, from the T-REX web server <ref type="bibr">[56]</ref>, was used to convert the distance matrix to phylogenetic trees. More details on converting the disorder contents to structural entropy, and the fitting of the structural capacities using different models, can be found in the supplementary materials (SM).</p><p>GitHub repository: <ref type="url">https://github.com/haoboguo/Protein-Structural-Entropy</ref>. Code written in R (random.seq.R) was used to generate 500 random sequences with random capacities in the range <ref type="bibr">[50,</ref><ref type="bibr">800]</ref>. A shell-script was written (entropy.csh) to perform protein structural entropy/information/capacity evaluations, using Equation (3). The PONDR-VSL2 predictor was downloaded from <ref type="url">http://www.dabi. temple.edu/disprot/predictor.php</ref>. JAVA is required to run this code to perform disorder predictions. See descriptions in above repository.</p></div></body>
		</text>
</TEI>
