skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models
Abstract Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.  more » « less
Award ID(s):
2052983 1900473
PAR ID:
10441080
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
24
Issue:
5
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationVolumetric 3D object analyses are being applied in research fields such as structural bioinformatics, biophysics, and structural biology, with potential integration of artificial intelligence/machine learning (AI/ML) techniques. One such method, 3D Zernike moments, has proven valuable in analyzing protein structures (e.g., protein fold classification, protein–protein interaction analysis, and molecular dynamics simulations). Their compactness and efficiency make them amenable to large-scale analyses. Established methods for deriving 3D Zernike moments, however, can be inefficient, particularly when higher order terms are required, hindering broader applications. As the volume of experimental and computationally-predicted protein structure information continues to increase, structural biology has become a “big data” science requiring more efficient analysis tools. ResultsThis application note presents a Python-based software package, ZMPY3D, to accelerate computation of 3D Zernike moments by vectorizing the mathematical formulae and using graphical processing units (GPUs). The package offers popular GPU-supported libraries such as CuPy and TensorFlow together with NumPy implementations, aiming to improve computational efficiency, adaptability, and flexibility in future algorithm development. The ZMPY3D package can be installed via PyPI, and the source code is available from GitHub. Volumetric-based protein 3D structural similarity scores and transform matrix of superposition functionalities have both been implemented, creating a powerful computational tool that will allow the research community to amalgamate 3D Zernike moments with existing AI/ML tools, to advance research and education in protein structure bioinformatics. Availability and implementationZMPY3D, implemented in Python, is available on GitHub (https://github.com/tawssie/ZMPY3D) and PyPI, released under the GPL License. 
    more » « less
  2. Abstract Structural and compositional diversities of proteins generate a number of functions for fabricating novel and advanced materials. Recent progress in protein engineering endows flexible approaches and new functionalities, which makes the fabricated materials potentially applicable in a broad spectrum of fields. Such engineering strategies by applying proteins alone or together with other molecules derive numerous functional materials such as patterned nanometal materials/nanometallic compounds, well‐designed nanocomposites, etc. Advantages in materials’ tunability, property improvement (e.g., electronic and mechanical properties, etc.), functionalities, and biocompatibility have been demonstrated, thus providing alternatives to existing materials via conventional methods. This review summarizes and discusses the strategies of fabricating functional materials using proteins as the critical contributors. Benefiting from their versatility, proteins find their roles in engineering functional materials via acting as structure‐control agents, reaction agents, and battery components, which are emphasized in this review. The strategies of each group of functions are specifically detailed. Properties of protein‐engineered functional materials and their potential applications in the fields of microelectronics, energy storage and conversion, sensor devices, etc. are also reviewed. 
    more » « less
  3. Background: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. Methods: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. Results: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. Conclusions: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities. 
    more » « less
  4. Abstract Structures of proteins and protein–protein complexes are determined by the same physical principles and thus share a number of similarities. At the same time, there could be differences because in order to function, proteins interact with other molecules, undergo conformations changes, and so forth, which might impose different restraints on the tertiary versus quaternary structures. This study focuses on structural properties of protein–protein interfaces in comparison with the protein core, based on the wealth of currently available structural data and new structure‐based approaches. The results showed that physicochemical characteristics, such as amino acid composition, residue–residue contact preferences, and hydrophilicity/hydrophobicity distributions, are similar in protein core and protein–protein interfaces. On the other hand, characteristics that reflect the evolutionary pressure, such as structural composition and packing, are largely different. The results provide important insight into fundamental properties of protein structure and function. At the same time, the results contribute to better understanding of the ways to dock proteins. Recent progress in predicting structures of individual proteins follows the advancement of deep learning techniques and new approaches to residue coevolution data. Protein core could potentially provide large amounts of data for application of the deep learning to docking. However, our results showed that the core motifs are significantly different from those at protein–protein interfaces, and thus may not be directly useful for docking. At the same time, such difference may help to overcome a major obstacle in application of the coevolutionary data to docking—discrimination of the intramolecular information not directly relevant to docking. 
    more » « less
  5. Abstract Programmable behavior combined with tailored stiffness and tunable biomechanical response are key requirements for developing successful materials. However, these properties are still an elusive goal for protein-based biomaterials. Here, we use protein-polymer interactions to manipulate the stiffness of protein-based hydrogels made from bovine serum albumin (BSA) by using polyelectrolytes such as polyethyleneimine (PEI) and poly-L-lysine (PLL) at various concentrations. This approach confers protein-hydrogels with tunable wide-range stiffness, from ~10–64 kPa, without affecting the protein mechanics and nanostructure. We use the 6-fold increase in stiffness induced by PEI to program BSA hydrogels in various shapes. By utilizing the characteristic protein unfolding we can induce reversible shape-memory behavior of these composite materials using chemical denaturing solutions. The approach demonstrated here, based on protein engineering and polymer reinforcing, may enable the development and investigation of smart biomaterials and extend protein hydrogel capabilities beyond their conventional applications. 
    more » « less