skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 2 until 12:00 AM ET on Saturday, May 3 due to maintenance. We apologize for the inconvenience.


Title: Molecular Representations for Machine Learning
This primer helps the reader understand the basic categories of molecular representations and provides computational tools to generate molecular descriptors in each of these categories. After reading this primer, you will be able to use various methods to generate machine and/or human interpretable representations of molecular systems for inputs to machine learning models or for general chemical data science applications.  more » « less
Award ID(s):
2143354
PAR ID:
10560309
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
American Chemical Society
Date Published:
Subject(s) / Keyword(s):
Machine learning artificial intelligence computational chemistry chemoinformatics molecular representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Human categorization is one of the most important and successful targets of cognitive modeling, with decades of model development and assessment using simple, low-dimensional artificial stimuli. However, it remains unclear how these findings relate to categorization in more natural settings, involving complex, high-dimensional stimuli. Here, we take a step towards addressing this question by modeling human categorization over a large behavioral dataset, comprising more than 500,000 judgments over 10,000 natural images from ten object categories. We apply a range of machine learning methods to generate candidate representations for these images, and show that combining rich image representations with flexible cognitive models captures human decisions best. We also find that in the high-dimensional representational spaces these methods generate, simple prototype models can perform comparably to the more complex memory-based exemplar models dominant in laboratory settings. 
    more » « less
  2. RNAs are often studied in nonnative sequence contexts to facilitate structural studies. However, seemingly innocuous changes to an RNA sequence may perturb the native structure and generate inaccurate or ambiguous structural models. To facilitate the investigation of native RNA secondary structure by selective 2′ hydroxyl acylation analyzed by primer extension (SHAPE), we engineered an approach that couples minimal enzymatic steps to RNA chemical probing and mutational profiling (MaP) reverse transcription (RT) methods—a process we call template switching and mutational profiling (Switch-MaP). In Switch-MaP, RT templates and additional library sequences are added postprobing through ligation and template switching, capturing reactivities for every nucleotide. For a candidate SAM-I riboswitch, we compared RNA structure models generated by the Switch-MaP approach to those of traditional primer-based MaP, including RNAs with or without appended structure cassettes. Primer-based MaP masked reactivity data in the 5′ and 3′ ends of the RNA, producing ambiguous ensembles inconsistent with the conserved SAM-I riboswitch secondary structure. Structure cassettes enabled unambiguous modeling of an aptamer-only construct but introduced nonnative interactions in the full-length riboswitch. In contrast, Switch-MaP provided reactivity data for all nucleotides in each RNA and enabled unambiguous modeling of secondary structure, consistent with the conserved SAM-I fold. Switch-MaP is a straightforward alternative approach to primer-based and cassette-based chemical probing methods that precludes primer masking and the formation of alternative secondary structures due to nonnative sequence elements. 
    more » « less
  3. null (Ed.)
    We investigate the problem of learning to generate 3D parametric surface representations for novel object instances, as seen from one or more views. Previous work on learning shape reconstruction from multiple views uses discrete representations such as point clouds or voxels, while continuous surface generation approaches lack multi-view consistency. We address these issues by designing neural networks capable of generating high-quality parametric 3D surfaces which are also consistent between views. Furthermore, the generated 3D surfaces preserve accurate image pixel to 3D surface point correspondences, allowing us to lift texture information to reconstruct shapes with rich geometry and appearance. Our method is supervised and trained on a public dataset of shapes from common object categories. Quantitative results indicate that our method significantly outperforms previous work, while qualitative results demonstrate the high quality of our reconstructions. 
    more » « less
  4. The emergence of data-intensive scientific discovery and machine learning has dramatically changed the way in which scientists and engineers approach materials design. Nevertheless, for designing macromolecules or polymers, one limitation is the lack of appropriate methods or standards for converting systems into chemically informed, machine-readable representations. This featurization process is critical to building predictive models that can guide polymer discovery. Although standard molecular featurization techniques have been deployed on homopolymers, such approaches capture neither the multiscale nature nor topological complexity of copolymers, and they have limited application to systems that cannot be characterized by a single repeat unit. Herein, we present, evaluate, and analyze a series of featurization strategies suitable for copolymer systems. These strategies are systematically examined in diverse prediction tasks sourced from four distinct datasets that enable understanding of how featurization can impact copolymer property prediction. Based on this comparative analysis, we suggest directly encoding polymer size in polymer representations when possible, adopting topological descriptors or convolutional neural networks when the precise polymer sequence is known, and using chemically informed unit representations when developing extrapolative models. These results provide guidance and future directions regarding polymer featurization for copolymer design by machine learning. 
    more » « less
  5. The linear decomposition attack provides a serious obstacle to direct applications of noncommutative groups and monoids (or semigroups) in cryptography. To overcome this issue we propose to look at monoids with only big representations, in the sense made precise in the paper, and undertake a systematic study of such monoids. One of our main tools is Green’s theory of cells (Green’s relations). A large supply of monoids is delivered by monoidal categories. We consider simple examples of monoidal categories of diagrammatic origin, including the Temperley–Lieb, the Brauer and partition categories, and discuss lower bounds for their representations. 
    more » « less