skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: New amino acid substitution matrix brings sequence alignments into agreement with structure matches
Abstract Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing densities in globular proteins lead to interdependent substitutions, which have not previously been considered for amino acid similarities. At present, sequence matching compares sequences based only upon the similarities of single amino acids, ignoring the fact that in densely packed protein, there are additional conservative substitutions representing exchanges between two interacting amino acids, such as a small‐large pair changing to a large‐small pair substitutions that are not individually so conservative. Here we show that including information for such pairs of substitutions yields improved sequence matches, and that these yield significant gains in the agreements between sequence alignments and structure matches of the same protein pair. The result shows sequence segments matched where structure segments are aligned. There are gains for all 2002 collected cases where the sequence alignments that were not previously congruent with the structure matches. Our results also demonstrate a significant gain in detecting homology for “twilight zone” protein sequences. The amino acid substitution metrics derived have many other potential applications, for annotations, protein design, mutagenesis design, and empirical potential derivation.  more » « less
Award ID(s):
1661391
PAR ID:
10452433
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proteins: Structure, Function, and Bioinformatics
Volume:
89
Issue:
6
ISSN:
0887-3585
Page Range / eLocation ID:
p. 671-682
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Most mammalian cells make both β- and γ-actin, two proteins which shape the cell’s internal skeleton and its ability to migrate. The molecules share over 99% of their sequence, yet they play distinct roles. In fact, deleting the β-actin gene in mice causes death in the womb, while the animals can survive with comparatively milder issues without their γ-actin gene. How two similar proteins can have such different biological roles is a long-standing mystery. A closer look could hold some clues: β- and γ-actin may contain the same blocks (or amino acids), but the genetic sequences that encode these proteins differ by about 13%. This is because different units of genetic information – known as synonymous codons – can encode the same amino acid. These ‘silent substitutions’ have no effect on the sequence of the proteins, yet a cell reads synonymous codons (and therefore produces proteins) at different speeds. To find out the impact of silent substitutions, Vedula et al. swapped the codons for the two proteins, forcing mouse cells to produce β-actin using γ-actin codons, and vice versa. Cells with non-manipulated γ-actin and those with β-actin made using γ-actin codons could move much faster than cells with β-actin. This suggested that silent substitutions were indeed affecting the role of the protein. Vedula et al. found that cells read γ-codons – and therefore made γ-actin – much more slowly than β-codons: this also affected how quickly the protein could be dispatched where it was needed in the cell. Slower production meant that bundles of γ-actin were shorter, which allowed cells to move faster by providing a weaker anchoring system. Overall, this work provides new links between silent substitutions and protein behavior, a relatively new research area which is likely to shed light on other protein families. 
    more » « less
  2. Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. 
    more » « less
  3. Protein-protein interactions play critical roles in biology, but the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions not yet identified. We take advantage of advances in proteome-wide amino acid coevolution analysis and deep-learning–based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of yeast proteins, identify 1505 likely to interact, and build structure models for 106 previously unidentified assemblies and 806 that have not been structurally characterized. These complexes, which have as many as five subunits, play roles in almost all key processes in eukaryotic cells and provide broad insights into biological function. 
    more » « less
  4. null (Ed.)
    Protein sequence space is vast; nature uses only an infinitesimal fraction of possible sequences to sustain life. Are there solutions to biological problems other than those provided by nature? Can we create artificial proteins that sustain life? To investigate these questions, we have created combinatorial collections, or libraries, of novel sequences with no homology to those found in living organisms. Previously designed libraries contained numerous functional proteins. However, they often formed dynamic, rather than well-ordered structures, which complicated structural and mechanistic characterization. To address this challenge, we describe the development of new libraries based on the de novo protein S-824, a 4-helix bundle with a very stable 3-dimensional structure. Distinct from previous libraries, we targeted variability to a specific region of the protein, seeking to create potential functional sites. By characterizing variant proteins from this library, we demonstrate that the S-824 scaffold tolerates diverse amino acid substitutions in a putative cavity, including buried polar residues suitable for catalysis. We designed and created a DNA library encoding 1.7 × 106 unique protein sequences. This new library of stable de novo α-helical proteins is well suited for screens and selections for a range of functional activities in vitro and in vivo. 
    more » « less
  5. Phylogenomic analyses have revolutionized the study of biodiversity, but they have revealed that estimated tree topologies can depend, at least in part, on the subset of the genome that is analyzed. For example, estimates of trees for avian orders differ if protein-coding or non-coding data are analyzed. The bird tree is a good study system because the historical signal for relationships among orders is very weak, which should permit subtle non-historical signals to be identified, while monophyly of orders is strongly corroborated, allowing identification of strong non-historical signals. Hydrophobic amino acids in mitochondrially-encoded proteins, which are expected to be found in transmembrane helices, have been hypothesized to be associated with non-historical signals. We tested this hypothesis by comparing the evolution of transmembrane helices and extramembrane segments of mitochondrial proteins from 420 bird species, sampled from most avian orders. We estimated amino acid exchangeabilities for both structural environments and assessed the performance of phylogenetic analysis using each data type. We compared those relative exchangeabilities with values calculated using a substitution matrix for transmembrane helices estimated using a variety of nuclear- and mitochondrially-encoded proteins, allowing us to compare the bird-specific mitochondrial models with a general model of transmembrane protein evolution. To complement our amino acid analyses, we examined the impact of protein structure on patterns of nucleotide evolution. Models of transmembrane and extramembrane sequence evolution for amino acids and nucleotides exhibited striking differences, but there was no evidence for strong topological data type effects. However, incorporating protein structure into analyses of mitochondrially-encoded proteins improved model fit. Thus, we believe that considering protein structure will improve analyses of mitogenomic data, both in birds and in other taxa. 
    more » « less