skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: New amino acid substitution matrix brings sequence alignments into agreement with structure matches
Abstract Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing densities in globular proteins lead to interdependent substitutions, which have not previously been considered for amino acid similarities. At present, sequence matching compares sequences based only upon the similarities of single amino acids, ignoring the fact that in densely packed protein, there are additional conservative substitutions representing exchanges between two interacting amino acids, such as a small‐large pair changing to a large‐small pair substitutions that are not individually so conservative. Here we show that including information for such pairs of substitutions yields improved sequence matches, and that these yield significant gains in the agreements between sequence alignments and structure matches of the same protein pair. The result shows sequence segments matched where structure segments are aligned. There are gains for all 2002 collected cases where the sequence alignments that were not previously congruent with the structure matches. Our results also demonstrate a significant gain in detecting homology for “twilight zone” protein sequences. The amino acid substitution metrics derived have many other potential applications, for annotations, protein design, mutagenesis design, and empirical potential derivation.  more » « less
Award ID(s):
1661391
PAR ID:
10452433
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proteins: Structure, Function, and Bioinformatics
Volume:
89
Issue:
6
ISSN:
0887-3585
Page Range / eLocation ID:
p. 671-682
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. 
    more » « less
  2. Protein-protein interactions play critical roles in biology, but the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions not yet identified. We take advantage of advances in proteome-wide amino acid coevolution analysis and deep-learning–based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of yeast proteins, identify 1505 likely to interact, and build structure models for 106 previously unidentified assemblies and 806 that have not been structurally characterized. These complexes, which have as many as five subunits, play roles in almost all key processes in eukaryotic cells and provide broad insights into biological function. 
    more » « less
  3. Most mammalian cells make both β- and γ-actin, two proteins which shape the cell’s internal skeleton and its ability to migrate. The molecules share over 99% of their sequence, yet they play distinct roles. In fact, deleting the β-actin gene in mice causes death in the womb, while the animals can survive with comparatively milder issues without their γ-actin gene. How two similar proteins can have such different biological roles is a long-standing mystery. A closer look could hold some clues: β- and γ-actin may contain the same blocks (or amino acids), but the genetic sequences that encode these proteins differ by about 13%. This is because different units of genetic information – known as synonymous codons – can encode the same amino acid. These ‘silent substitutions’ have no effect on the sequence of the proteins, yet a cell reads synonymous codons (and therefore produces proteins) at different speeds. To find out the impact of silent substitutions, Vedula et al. swapped the codons for the two proteins, forcing mouse cells to produce β-actin using γ-actin codons, and vice versa. Cells with non-manipulated γ-actin and those with β-actin made using γ-actin codons could move much faster than cells with β-actin. This suggested that silent substitutions were indeed affecting the role of the protein. Vedula et al. found that cells read γ-codons – and therefore made γ-actin – much more slowly than β-codons: this also affected how quickly the protein could be dispatched where it was needed in the cell. Slower production meant that bundles of γ-actin were shorter, which allowed cells to move faster by providing a weaker anchoring system. Overall, this work provides new links between silent substitutions and protein behavior, a relatively new research area which is likely to shed light on other protein families. 
    more » « less
  4. null (Ed.)
    Protein sequence space is vast; nature uses only an infinitesimal fraction of possible sequences to sustain life. Are there solutions to biological problems other than those provided by nature? Can we create artificial proteins that sustain life? To investigate these questions, we have created combinatorial collections, or libraries, of novel sequences with no homology to those found in living organisms. Previously designed libraries contained numerous functional proteins. However, they often formed dynamic, rather than well-ordered structures, which complicated structural and mechanistic characterization. To address this challenge, we describe the development of new libraries based on the de novo protein S-824, a 4-helix bundle with a very stable 3-dimensional structure. Distinct from previous libraries, we targeted variability to a specific region of the protein, seeking to create potential functional sites. By characterizing variant proteins from this library, we demonstrate that the S-824 scaffold tolerates diverse amino acid substitutions in a putative cavity, including buried polar residues suitable for catalysis. We designed and created a DNA library encoding 1.7 × 106 unique protein sequences. This new library of stable de novo α-helical proteins is well suited for screens and selections for a range of functional activities in vitro and in vivo. 
    more » « less
  5. Short-range interactions and long-range contacts drive the 3D folding of structured proteins. The proteins’ structure has a direct impact on their biological function. However, nearly 40% of the eukaryotes proteome is composed of intrinsically disordered proteins (IDPs) and protein regions that fluctuate between ensembles of numerous conformations. Therefore, to understand their biological function, it is critical to depict how the structural ensemble statistics correlate to the IDPs’ amino acid sequence. Here, using small-angle X-ray scattering and time-resolved Förster resonance energy transfer (trFRET), we study the intramolecular structural heterogeneity of the neurofilament low intrinsically disordered tail domain (NFLt). Using theoretical results of polymer physics, we find that the Flory scaling exponent of NFLt subsegments correlates linearly with their net charge, ranging from statistics of ideal to self-avoiding chains. Surprisingly, measuring the same segments in the context of the whole NFLt protein, we find that regardless of the peptide sequence, the segments’ structural statistics are more expanded than when measured independently. Our findings show that while polymer physics can, to some level, relate the IDP’s sequence to its ensemble conformations, long-range contacts between distant amino acids play a crucial role in determining intramolecular structures. This emphasizes the necessity of advanced polymer theories to fully describe IDPs ensembles with the hope that it will allow us to model their biological function. 
    more » « less