skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

This content will become publicly available on September 26, 2024

Title: Quantifying Pairwise Similarity for Complex Polymers
Defining the similarity between chemical entities is an essential task in polymer informatics, enabling ranking, clustering, and classification. Despite its importance, the pairwise chemical similarity of polymers remains an open problem. Here, a similarity function for polymers with well-defined backbones is designed based on polymers’ stochastic graph representations generated from canonical BigSMILES, a structurally based line notation for describing macromolecules. The stochastic graph representations are separated into three parts: repeat units, end groups, and polymer topology. The earth mover’s distance is utilized to calculate the similarity of the repeat units and end groups, while the graph edit distance is used to calculate the similarity of the topology. These three values can be linearly or nonlinearly combined to yield an overall pairwise chemical similarity score for polymers that is largely consistent with the chemical intuition of expert users and is adjustable based on the relative importance of different chemical features for a given similarity problem. This method gives a reliable solution to quantitatively calculate the pairwise chemical similarity score for polymers and represents a vital step toward building search engines and quantitative design tools for polymer data.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
ACS Publications
Date Published:
Journal Name:
Page Range / eLocation ID:
7344 to 7357
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Molecular search is important in chemistry, biology, and informatics for identifying molecular structures within large data sets, improving knowledge discovery and innovation, and making chemical data FAIR (findable, accessible, interoperable, reusable). Search algorithms for polymers are significantly less developed than those for small molecules because polymer search relies on searching by polymer name, which can be challenging because polymer naming is overly broad (i.e., polyethylene), complicated for complex chemical structures, and often does not correspond to official IUPAC conventions. Chemical structure search in polymers is limited to substructures, such as monomers, without awareness of connectivity or topology. This work introduces a novel query language and graph traversal search algorithm for polymers that provides the first search method able to fully capture all of the chemical structures present in polymers. The BigSMARTS query language, an extension of the small-molecule SMARTS language, allows users to write queries that localize monomer and functional group searches to different parts of the polymer, like the middle block of a triblock, the side chain of a graft, and the backbone of a repeat unit. The substructure search algorithm is based on the traversal of graph representations of the generating functions for the stochastic graphs of polymers. Operationally, the algorithm first identifies cycles representing the monomers and then the end groups and finally performs a depth-first search to match entire subgraphs. To validate the algorithm, hundreds of queries were searched against hundreds of target chemistries and topologies from the literature, with approximately 440,000 query–target pairs. This tool provides a detailed algorithm that can be implemented in search engines to provide search results with full matching of the monomer connectivity and polymer topology. 
    more » « less
  2. We propose a chemical language processing model to predict polymers’ glass transition temperature (Tg) through a polymer language (SMILES, Simplified Molecular Input Line Entry System) embedding and recurrent neural network. This model only receives the SMILES strings of a polymer’s repeat units as inputs and considers the SMILES strings as sequential data at the character level. Using this method, there is no need to calculate any additional molecular descriptors or fingerprints of polymers, and thereby, being very computationally efficient. More importantly, it avoids the difficulties to generate molecular descriptors for repeat units containing polymerization point ‘*’. Results show that the trained model demonstrates reasonable prediction performance on unseen polymer’s Tg. Besides, this model is further applied for high-throughput screening on an unlabeled polymer database to identify high-temperature polymers that are desired for applications in extreme environments. Our work demonstrates that the SMILES strings of polymer repeat units can be used as an effective feature representation to develop a chemical language processing model for predictions of polymer Tg. The framework of this model is general and can be used to construct structure–property relationships for other polymer properties. 
    more » « less

    Carbohydrates are the fundamental building blocks of many natural polymers, their wide bioavailability, high chemical functionality, and stereochemical diversity make them attractive starting materials for the development of new synthetic polymers. In this work, one such carbohydrate,d‐glucopyranoside, was utilized to produce a hydrophobic five‐membered cyclic carbonate monomer to afford sugar‐based amphiphilic copolymers and block copolymers via organocatalyzed ring‐opening polymerizations with 4‐methylbenzyl alcohol and methoxy poly(ethylene glycol) as initiator and macroinitiator, respectively. To modulate the amphiphilicities of these polymers acidic benzylidene cleavage reactions were performed to deprotect the sugar repeat units and present hydrophilic hydroxyl side chain groups. Assembly of the polymers under aqueous conditions revealed interesting morphological differences, based on the polymer molar mass and repeat unit composition. The initial polymers, prior to the removal of the benzylidenes, underwent a morphological change from micelles to vesicles as the sugar block length was increased, causing a decrease in the hydrophilic–hydrophobic ratio. Deprotection of the sugar block increased the hydrophilicity and gave micellar morphologies. This tunable polymeric platform holds promise for the production of advanced materials for implementation in a diverse range of applications. © 2018 Wiley Periodicals, Inc. J. Polym. Sci., Part A: Polym. Chem.2019,57, 432–440

    more » « less
  4. Abstract

    Noncoding RNAs (ncRNAs) have recently attracted considerable attention due to their key roles in biology. The ncRNA–proteins interaction (NPI) is often explored to reveal some biological activities that ncRNA may affect, such as biological traits, diseases, etc. Traditional experimental methods can accomplish this work but are often labor-intensive and expensive. Machine learning and deep learning methods have achieved great success by exploiting sufficient sequence or structure information. Graph Neural Network (GNN)-based methods consider the topology in ncRNA–protein graphs and perform well on tasks like NPI prediction. Based on GNN, some pairwise constraint methods have been developed to apply on homogeneous networks, but not used for NPI prediction on heterogeneous networks. In this paper, we construct a pairwise constrained NPI predictor based on dual Graph Convolutional Network (GCN) called NPI-DGCN. To our knowledge, our method is the first to train a heterogeneous graph-based model using a pairwise learning strategy. Instead of binary classification, we use a rank layer to calculate the score of an ncRNA–protein pair. Moreover, our model is the first to predict NPIs on the ncRNA–protein bipartite graph rather than the homogeneous graph. We transform the original ncRNA–protein bipartite graph into two homogenous graphs on which to explore second-order implicit relationships. At the same time, we model direct interactions between two homogenous graphs to explore explicit relationships. Experimental results on the four standard datasets indicate that our method achieves competitive performance with other state-of-the-art methods. And the model is available at

    more » « less
  5. Controlling network growth and architecture of 3D-conjugated porous polymers (CPPs) is challenging and therefore has limited the ability to systematically tune the network architecture and study its impact on doping efficiency and conductivity. We have proposed that π-face masking straps mask the π-face of the polymer backbone and therefore help to control π–π interchain interactions in higher dimensional π-conjugated materials unlike the conventional linear alkyl pendant solubilizing chains that are incapable of masking the π-face. Herein, we used cycloaraliphane-based π-face masking strapped monomers and show that the strapped repeat units, unlike the conventional monomers, help to overcome the strong interchain π–π interactions, extend network residence time, tune network growth, and increase chemical doping and conductivity in 3D-conjugated porous polymers. The straps doubled the network crosslinking density, which resulted in 18 times higher chemical doping efficiency compared to the control non-strapped-CPP. The straps also provided synthetic tunability and generated CPPs of varying network size, crosslinking density, dispersibility limit, and chemical doping efficiency by changing the knot to strut ratio. For the first time, we have shown that the processability issue of CPPs can be overcome by blending them with insulating commodity polymers. The blending of CPPs with poly(methylmethacrylate) (PMMA) has enabled them to be processed into thin films for conductivity measurements. The conductivity of strapped-CPPs is three orders of magnitude higher than that of the poly(phenyleneethynylene) porous network.

    more » « less