skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Featurization strategies for polymer sequence or composition design by machine learning
The emergence of data-intensive scientific discovery and machine learning has dramatically changed the way in which scientists and engineers approach materials design. Nevertheless, for designing macromolecules or polymers, one limitation is the lack of appropriate methods or standards for converting systems into chemically informed, machine-readable representations. This featurization process is critical to building predictive models that can guide polymer discovery. Although standard molecular featurization techniques have been deployed on homopolymers, such approaches capture neither the multiscale nature nor topological complexity of copolymers, and they have limited application to systems that cannot be characterized by a single repeat unit. Herein, we present, evaluate, and analyze a series of featurization strategies suitable for copolymer systems. These strategies are systematically examined in diverse prediction tasks sourced from four distinct datasets that enable understanding of how featurization can impact copolymer property prediction. Based on this comparative analysis, we suggest directly encoding polymer size in polymer representations when possible, adopting topological descriptors or convolutional neural networks when the precise polymer sequence is known, and using chemically informed unit representations when developing extrapolative models. These results provide guidance and future directions regarding polymer featurization for copolymer design by machine learning.  more » « less
Award ID(s):
2118861
PAR ID:
10339788
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Molecular Systems Design & Engineering
Volume:
7
Issue:
6
ISSN:
2058-9689
Page Range / eLocation ID:
661 to 676
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Polymer–protein hybrids are intriguing materials that can bolster protein stability in non‐native environments, thereby enhancing their utility in diverse medicinal, commercial, and industrial applications. One stabilization strategy involves designing synthetic random copolymers with compositions attuned to the protein surface, but rational design is complicated by the vast chemical and composition space. Here, a strategy is reported to design protein‐stabilizing copolymers based on active machine learning, facilitated by automated material synthesis and characterization platforms. The versatility and robustness of the approach is demonstrated by the successful identification of copolymers that preserve, or even enhance, the activity of three chemically distinct enzymes following exposure to thermal denaturing conditions. Although systematic screening results in mixed success, active learning appropriately identifies unique and effective copolymer chemistries for the stabilization of each enzyme. Overall, this work broadens the capabilities to design fit‐for‐purpose synthetic copolymers that promote or otherwise manipulate protein activity, with extensions toward the design of robust polymer–protein hybrid materials. 
    more » « less
  2. The field of polymer membrane design is primarily based on empirical observation, which limits discovery of new materials optimized for separating a given gas pair. Instead of relying on exhaustive experimental investigations, we trained a machine learning (ML) algorithm, using a topological, path-based hash of the polymer repeating unit. We used a limited set of experimental gas permeability data for six different gases in ~700 polymeric constructs that have been measured to date to predict the gas-separation behavior of over 11,000 homopolymers not previously tested for these properties. To test the algorithm’s accuracy, we synthesized two of the most promising polymer membranes predicted by this approach and found that they exceeded the upper bound for CO 2 /CH 4 separation performance. This ML technique, which is trained using a relatively small body of experimental data (and no simulation data), evidently represents an innovative means of exploring the vast phase space available for polymer membrane design. 
    more » « less
  3. Abstract Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development. 
    more » « less
  4. null (Ed.)
    Study of the permeability of small organic molecules across lipid membranes plays a significant role in designing potential drugs in the field of drug discovery. Approaches to design promising drug molecules have gone through many stages, from experiment-based trail-and-error approaches, to the well-established avenue of the quantitative structure–activity relationship, and currently to the stage guided by machine learning (ML) and artificial intelligence techniques. In this work, we present a study of the permeability of small drug-like molecules across lipid membranes by two types of ML models, namely the least absolute shrinkage and selection operator (LASSO) and deep neural network (DNN) models. Molecular descriptors and fingerprints are used for featurization of organic molecules. Using molecular descriptors, the LASSO model uncovers that the electro-topological, electrostatic, polarizability, and hydrophobicity/hydrophilicity properties are the most important physical properties to determine the membrane permeability of small drug-like molecules. Additionally, with molecular fingerprints, the LASSO model suggests that certain chemical substructures can significantly affect the permeability of organic molecules, which closely connects to the identified main physical properties. Moreover, the DNN model using molecular fingerprints can help develop a more accurate mapping between molecular structures and their membrane permeability than LASSO models. Our results provide deep understanding of drug–membrane interactions and useful guidance for the inverse molecular design of drug-like molecules. Last but not least, while the current focus is on the permeability of drug-like molecules, the methodology of this work is general and can be applied for other complex physical chemistry problems to gain molecular insights. 
    more » « less
  5. Machine-learning (ML) approaches have proven to be of great utility in modern materials innovation pipelines. Generally, ML models are trained on predetermined past data and then used to make predictions for new test cases. Active-learning, however, is a paradigm in which ML models can direct the learning process itself through providing dynamic suggestions/queries for the “next-best experiment.” In this work, the authors demonstrate how an active-learning framework can aid in the discovery of polymers possessing high glass transition temperatures ( T g ). Starting from an initial small dataset of polymer T g measurements, the authors use Gaussian process regression in conjunction with an active-learning framework to iteratively add T g measurements of candidate polymers to the training dataset. The active-learning framework employs one of three decision making strategies (exploitation, exploration, or balanced exploitation/exploration) for selection of the “next-best experiment.” The active-learning workflow terminates once 10 polymers possessing a T g greater than a certain threshold temperature are selected. The authors statistically benchmark the performance of the aforementioned three strategies (against a random selection approach) with respect to the discovery of high- T g polymers for this particular demonstrative materials design challenge. 
    more » « less