skip to main content


This content will become publicly available on November 13, 2024

Title: BigSMARTS: A Topologically Aware Query Language and Substructure Search Algorithm for Polymer Chemical Structures
Molecular search is important in chemistry, biology, and informatics for identifying molecular structures within large data sets, improving knowledge discovery and innovation, and making chemical data FAIR (findable, accessible, interoperable, reusable). Search algorithms for polymers are significantly less developed than those for small molecules because polymer search relies on searching by polymer name, which can be challenging because polymer naming is overly broad (i.e., polyethylene), complicated for complex chemical structures, and often does not correspond to official IUPAC conventions. Chemical structure search in polymers is limited to substructures, such as monomers, without awareness of connectivity or topology. This work introduces a novel query language and graph traversal search algorithm for polymers that provides the first search method able to fully capture all of the chemical structures present in polymers. The BigSMARTS query language, an extension of the small-molecule SMARTS language, allows users to write queries that localize monomer and functional group searches to different parts of the polymer, like the middle block of a triblock, the side chain of a graft, and the backbone of a repeat unit. The substructure search algorithm is based on the traversal of graph representations of the generating functions for the stochastic graphs of polymers. Operationally, the algorithm first identifies cycles representing the monomers and then the end groups and finally performs a depth-first search to match entire subgraphs. To validate the algorithm, hundreds of queries were searched against hundreds of target chemistries and topologies from the literature, with approximately 440,000 query–target pairs. This tool provides a detailed algorithm that can be implemented in search engines to provide search results with full matching of the monomer connectivity and polymer topology.  more » « less
Award ID(s):
2134795
NSF-PAR ID:
10479694
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACS Publications
Date Published:
Journal Name:
Journal of Chemical Information and Modeling
Volume:
63
Issue:
21
ISSN:
1549-9596
Page Range / eLocation ID:
6555 to 6568
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We consider the problem of type-directed component-based synthesis where, given a set of (typed) components and a query type , the goal is to synthesize a term that inhabits the query. Classical approaches based on proof search in intuitionistic logics do not scale up to the standard libraries of modern languages, which span hundreds or thousands of components. Recent graph reachability based methods proposed for Java do scale, but only apply to monomorphic data and components: polymorphic data and components infinitely explode the size of the graph that must be searched, rendering synthesis intractable. We introduce type-guided abstraction refinement (TYGAR), a new approach for scalable type-directed synthesis over polymorphic datatypes and components. Our key insight is that we can overcome the explosion by building a graph over abstract types which represent a potentially unbounded set of concrete types. We show how to use graph reachability to search for candidate terms over abstract types, and introduce a new algorithm that uses proofs of untypeability of ill-typed candidates to iteratively refine the abstraction until a well-typed result is found. We have implemented TYGAR in H+, a tool that takes as input a set of Haskell libraries and a query type, and returns a Haskell term that uses functions from the provided libraries to implement the query type. Our support for polymorphism allows H+ to work with higher-order functions and type classes, and enables more precise queries due to parametricity. We have evaluated H+ on 44 queries using a set of popular Haskell libraries with a total of 291 components. H+ returns an interesting solution within the first five results for 32 out of 44 queries. Our results show that TYGAR allows H+ to rapidly return well-typed terms, with the median time to first solution of just 1.4 seconds. Moreover, we observe that gains from iterative refinement over exhaustive enumeration are more pronounced on harder queries. 
    more » « less
  2. Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold {\theta}. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin's well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for {\theta}-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine. 
    more » « less
  3. Query understanding plays a key role in exploring users’ search intents. However, it is inherently challenging since it needs to capture semantic information from short and ambiguous queries and often requires massive task-specific labeled data. In recent years, pre-trained language models (PLMs) have advanced various natural language processing tasks because they can extract general semantic information from large-scale corpora. However, directly applying them to query understanding is sub-optimal because existing strategies rarely consider to boost the search performance. On the other hand, search logs contain user clicks between queries and urls that provide rich users’ search behavioral information on queries beyond their content. Therefore, in this paper, we aim to fill this gap by exploring search logs. In particular, we propose a novel graph-enhanced pre-training framework, GE-BERT, which leverages both query content and the query graph to capture both semantic information and users’ search behavioral information of queries. Extensive experiments on offline and online tasks have demonstrated the effectiveness of the proposed framework. 
    more » « less
  4. Abstract

    Topochemical polymerizations hold the promise of producing high molecular weight and stereoregular single crystalline polymers by first aligning monomers before polymerization. However, monomer modifications often alter the crystal packing and result in non‐reactive polymorphs. Here, we report a systematic study on the side chain functionalization of the bis(indandione) derivative system that can be polymerized under visible light. Precisely engineered side chains help organize the monomer crystals in a one‐dimensional fashion to maintain the topochemical reactivity. By optimizing the side chain length and end group of monomers, the elastic modulus of the resulting polymer single crystals can also be greatly enhanced. Lastly, using ultrasonication, insoluble polymer single crystals can be processed into free‐standing and robust polymer thin films. This work provides new insights on the molecular design of topochemical reactions and paves the way for future applications of this fascinating family of materials.

     
    more » « less
  5. Abstract

    Topochemical polymerizations hold the promise of producing high molecular weight and stereoregular single crystalline polymers by first aligning monomers before polymerization. However, monomer modifications often alter the crystal packing and result in non‐reactive polymorphs. Here, we report a systematic study on the side chain functionalization of the bis(indandione) derivative system that can be polymerized under visible light. Precisely engineered side chains help organize the monomer crystals in a one‐dimensional fashion to maintain the topochemical reactivity. By optimizing the side chain length and end group of monomers, the elastic modulus of the resulting polymer single crystals can also be greatly enhanced. Lastly, using ultrasonication, insoluble polymer single crystals can be processed into free‐standing and robust polymer thin films. This work provides new insights on the molecular design of topochemical reactions and paves the way for future applications of this fascinating family of materials.

     
    more » « less