Abstract Generative deep learning methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a denoising diffusion framework. However, such methods are unable to learn important geometric properties of 3D molecules, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which notably hinders their ability to generate valid large 3D molecules. In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset and the larger GEOM-Drugs dataset, respectively. Importantly, we demonstrate that GCDM’s generative denoising process enables the model to generate a significant proportion of valid and energetically-stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that extensions of GCDM can not only effectively design 3D molecules for specific protein pockets but can be repurposed to consistently optimize the geometry and chemical composition of existing 3D molecules for molecular stability and property specificity, demonstrating new versatility of molecular diffusion models. Code and data are freely available onGitHub.
more »
« less
Reference-free NOE NMR analysis
Nuclear Overhauser Effect (NOE) methods in NMR are an important tool for 3D structural analysis of small molecules. Quantitative NOE methods conventionally rely on reference distances, known distances that have to be spectrally separated and are not always available. Here we present a new method for evaluation and 3D structure selection that does not require a reference distance, instead utilizing structures optimized by molecular mechanics, enabling NOE evaluation even on molecules without suitable reference groups.
more »
« less
- Award ID(s):
- 1726525
- PAR ID:
- 10214890
- Date Published:
- Journal Name:
- Chemical Science
- Volume:
- 11
- Issue:
- 36
- ISSN:
- 2041-6520
- Page Range / eLocation ID:
- 9930 to 9936
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Existing generative models for 3D shapes are typically trained on a large 3D dataset, often of a specific object category. In this paper, we investigate the deep generative model that learns from only a single reference 3D shape. Specifically, we present a multi-scale GAN-based model designed to capture the input shape's geometric features across a range of spatial scales. To avoid large memory and computational cost induced by operating on the 3D volume, we build our generator atop the tri-plane hybrid representation, which requires only 2D convolutions. We train our generative model on a voxel pyramid of the reference shape, without the need of any external supervision or manual annotation. Once trained, our model can generate diverse and high-quality 3D shapes possibly of different sizes and aspect ratios. The resulting shapes present variations across different scales, and at the same time retain the global structure of the reference shape. Through extensive evaluation, both qualitative and quantitative, we demonstrate that our model can generate 3D shapes of various types. 1more » « less
-
Hydration free energies of small molecules are commonly used as benchmarks for solvation models. However, errors in predicting hydration free energies are partially due to the force fields used and not just the solvation model. To address this, we have used the 3D reference interaction site model (3D-RISM) of molecular solvation and existing benchmark explicit solvent calculations with a simple element count correction (ECC) to identify problems with the non-bond parameters in the general AMBER force field (GAFF). 3D-RISM was used to calculate hydration free energies of all 642 molecules in the FreeSolv database, and a partial molar volume correction (PMVC), ECC, and their combination (PMVECC) were applied to the results. The PMVECC produced a mean unsigned error of 1.01±0.04kcal/mol and root mean squared error of 1.44±0.07kcal/mol, better than the benchmark explicit solvent calculations from FreeSolv, and required less than 15 s of computing time per molecule on a single CPU core. Importantly, parameters for PMVECC showed systematic errors for molecules containing Cl, Br, I, and P. Applying ECC to the explicit solvent hydration free energies found the same systematic errors. The results strongly suggest that some small adjustments to the Lennard–Jones parameters for GAFF will lead to improved hydration free energy calculations for all solvent models.more » « less
-
We describe MPSE: a Multi-Perspective Simultaneous Embedding method for visualizing high-dimensional data, based on multiple pairwise distances between the data points. Specifically, MPSE computes positions for the points in 3D and provides different views into the data by means of 2D projections (planes) that preserve each of the given distance matrices. We consider two versions of the problem: fixed projections and variable projections. MPSE with fixed projections takes as input a set of pairwise distance matrices defined on the data points, along with the same number of projections and embeds the points in 3D so that the pairwise distances are preserved in the given projections. MPSE with variable projections takes as input a set of pairwise distance matrices and embeds the points in 3D while also computing the appropriate projections that preserve the pairwise distances. The proposed approach can be useful in multiple scenarios: from creating simultaneous embedding of multiple graphs on the same set of vertices, to reconstructing a 3D object from multiple 2D snapshots, to analyzing data from multiple points of view. We provide a functional prototype of MPSE that is based on an adaptive and stochastic generalization of multi-dimensional scaling to multiple distances and multiple variable projections. We provide an extensive quantitative evaluation with datasets of different sizes and using different number of projections, as well as several examples that illustrate the quality of the resulting solutions.more » « less
-
Abstract Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.more » « less
An official website of the United States government

