skip to main content

Title: A generalized deep learning approach for local structure identification in molecular simulations
Identifying local structure in molecular simulations is of utmost importance. The most common existing approach to identify local structure is to calculate some geometrical quantity referred to as an order parameter. In simple cases order parameters are physically intuitive and trivial to develop ( e.g. , ion-pair distance), however in most cases, order parameter development becomes a much more difficult endeavor ( e.g. , crystal structure identification). Using ideas from computer vision, we adapt a specific type of neural network called a PointNet to identify local structural environments in molecular simulations. A primary challenge in applying machine learning techniques to simulation is selecting the appropriate input features. This challenge is system-specific and requires significant human input and intuition. In contrast, our approach is a generic framework that requires no system-specific feature engineering and operates on the raw output of the simulations, i.e. , atomic positions. We demonstrate the method on crystal structure identification in Lennard-Jones (four different phases), water (eight different phases), and mesophase (six different phases) systems. The method achieves as high as 99.5% accuracy in crystal structure identification. The method is applicable to heterogeneous nucleation and it can even predict the crystal phases of atoms near external interfaces. more » We demonstrate the versatility of our approach by using our method to identify surface hydrophobicity based solely upon positions and orientations of surrounding water molecules. Our results suggest the approach will be broadly applicable to many types of local structure in simulations. « less
Authors:
; ; ; ;
Award ID(s):
1725573
Publication Date:
NSF-PAR ID:
10109492
Journal Name:
Chemical Science
ISSN:
2041-6520
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT: Molecular simulations with atomistic or coarse- 6 grained force fields are a powerful approach for understanding and 7 predicting the self-assembly phase behavior of complex molecules. 8 Amphiphiles, block oligomers, and block polymers can form 9 mesophases with different ordered morphologies describing the 10 spatial distribution of the blocks, but entirely amorphous nature for 11 local packing and chain conformation. Screening block oligomer 12 chemistry and architecture through molecular simulations to find 13 promising candidates for functional materials is aided by effective 14 and straightforward morphology identification techniques. Captur- 15 ing 3-dimensional periodic structures, such as ordered network 16 morphologies, is hampered by the requirement that the number of 17 molecules in the simulated system and the shape of the periodic simulation box need to be commensurate with those of the resulting 18 network phase. Common strategies for structure identification include structure factors and order parameters, but these fail to 19 identify imperfect structures in simulations with incorrect system sizes. Building upon pioneering work by DeFever et al. [Chem. Sci. 20 2019, 10, 7503−7515] who implemented a PointNet (i.e., a neural network designed for computer vision applications using point 21 clouds) to detect local structure in simulations of single-beadmore »particles and water molecules, we present a PointNet for detection of 22 nonlocal ordered morphologies of complex block oligomers. Our PointNet was trained using atomic coordinates from molecular 23 dynamics simulation trajectories and synthetic point clouds for ordered network morphologies that were absent from previous 24 simulations. In contrast to prior work on simple molecules, we observe that large point clouds with 1000 or more points are needed 25 for the more complex block oligomers. The trained PointNet model achieves an accuracy as high as 0.99 for globally ordered 26 morphologies formed by linear diblock, linear triblock, and 3-arm and 4-arm star-block oligomers, and it also allows for the discovery 27 of emerging ordered patterns from nonequilibrium systems.« less
  2. Context. The excitation of the filamentary gas structures surrounding giant elliptical galaxies at the center of cool-core clusters, also known as brightest cluster galaxies (BCGs), is key to our understanding of active galactic nucleus (AGN) feedback, and of the impact of environmental and local effects on star formation. Aims. We investigate the contribution of thermal radiation from the cooling flow surrounding BCGs to the excitation of the filaments. We explore the effects of small levels of extra heating (turbulence), and of metallicity, on the optical and infrared lines. Methods. Using the C LOUDY code, we modeled the photoionization and photodissociation of a slab of gas of optical depth A V  ≤ 30 mag at constant pressure in order to calculate self-consistently all of the gas phases, from ionized gas to molecular gas. The ionizing source is the extreme ultraviolet (EUV) and soft X-ray radiation emitted by the cooling gas. We tested these models comparing their predictions to the rich multi-wavelength observations from optical to submillimeter, now achieved in cool core clusters. Results. Such models of self-irradiated clouds, when reaching sufficiently large A V , lead to a cloud structure with ionized, atomic, and molecular gas phases. These models reproduce most ofmore »the multi-wavelength spectra observed in the nebulae surrounding the BCGs, not only the low-ionization nuclear emission region like optical diagnostics, [O  III ] λ 5007 Å/H β , [N  II ] λ 6583 Å/H α , and ([S  II ] λ 6716 Å+[S  II ] λ 6731 Å)/H α , but also the infrared emission lines from the atomic gas. [O  I ] λ 6300 Å/H α , instead, is overestimated across the full parameter space, except for very low A V . The modeled ro-vibrational H 2 lines also match observations, which indicates that near- and mid-infrared H 2 lines are mostly excited by collisions between H 2 molecules and secondary electrons produced naturally inside the cloud by the interaction between the X-rays and the cold gas in the filament. However, there is still some tension between ionized and molecular line tracers (i.e., CO), which requires optimization of the cloud structure and the density of the molecular zone. The limited range of parameters over which predictions match observations allows us to constrain, in spite of degeneracies in the parameter space, the intensity of X-ray radiation bathing filaments, as well as some of their physical properties like A V or the level of turbulent heating rate. Conclusions. The reprocessing of the EUV and X-ray radiation from the plasma cooling is an important powering source of line emission from filaments surrounding BCGs. C LOUDY self-irradiated X-ray excitation models coupled with a small level of turbulent heating manage to simultaneously reproduce a large number of optical-to-infrared line ratios when all the gas phases (from ionized to molecular) are modeled self-consistently. Releasing some of the simplifications of our model, like the constant pressure, or adding the radiation fields from the AGN and stars, as well as a combination of matter- and radiation-bounded cloud distribution, should improve the predictions of line emission from the different gas phases.« less
  3. Purpose Product developers using life cycle toxicity characterization models to understand the potential impacts of chemical emissions face serious challenges related to large data demands and high input data uncertainty. This motivates greater focus on model sensitivity toward input parameter variability to guide research efforts in data refinement and design of experiments for existing and emerging chemicals alike. This study presents a sensitivity-based approach for estimating toxicity characterization factors given high input data uncertainty and using the results to prioritize data collection according to parameter influence on characterization factors (CFs). Proof of concept is illustrated with the UNEP-SETAC scientific consensus model USEtox. Methods Using Monte Carlo analysis, we demonstrate a sensitivity-based approach to prioritize data collection with an illustrative example of aquatic ecotoxicity CFs for the vitamin B derivative niacinamide, which is an antioxidant used in personal care products. We calculate CFs via 10,000 iterations assuming plus-or-minus one order of magnitude variability in fate and exposure-relevant data inputs, while uncertainty in effect factor data is modeled as a central t distribution. Spearman’s rank correlation indices are used for all variable inputs to identify parameters with the largest influence on CFs. Results and discussion For emissions to freshwater, the niacinamide CFmore »is near log-normally distributed with a geometric mean of 0.02 and geometric standard deviation of 8.5 PAF m3 day/kg. Results of Spearman’s rank correlation show that degradation rates in air, water, and soil are the most influential parameters in calculating CFs, thus benefiting the most from future data refinement and experimental research. Kow, sediment degradation rate, and vapor pressure were the least influential parameters on CF results. These results may be very different for other, e.g., more lipophilic chemicals, where Kow is known to drive many fate and exposure aspects in multimedia modeling. Furthermore, non-linearity between input parameters and CF results prevents transferring sensitivity conclusions from one chemical to another. Conclusions A sensitivity-based approach for data refinement and research prioritization can provide guidance to database managers, life cycle assessment practitioners, and experimentalists to concentrate efforts on the few parameters that are most influential on toxicity characterization model results. Researchers can conserve resources and address parameter uncertainty by applying this approach when developing new or refining existing CFs for the inventory items that contribute most to toxicity impacts.« less
  4. Abstract. We present a rapid method for apportioning the sources of atmospheric organic aerosol composition measured by gas chromatography–mass spectrometry methods. Here, we specifically apply this new analysis method to data acquired on a thermal desorption aerosol gas chromatograph (TAG) system. Gas chromatograms are divided by retention time into evenly spaced bins, within which the mass spectra are summed. A previous chromatogram binning method was introduced for the purpose of chromatogram structure deconvolution (e.g., major compound classes) (Zhang et al., 2014). Here we extend the method development for the specific purpose of determining aerosol samples' sources. Chromatogram bins are arranged into an input data matrix for positive matrix factorization (PMF), where the sample number is the row dimension and the mass-spectra-resolved eluting time intervals (bins) are the column dimension. Then two-dimensional PMF can effectively do three-dimensional factorization on the three-dimensional TAG mass spectra data. The retention time shift of the chromatogram is corrected by applying the median values of the different peaks' shifts. Bin width affects chemical resolution but does not affect PMF retrieval of the sources' time variations for low-factor solutions. A bin width smaller than the maximum retention shift among all samples requires retention time shift correction. Amore »six-factor PMF comparison among aerosol mass spectrometry (AMS), TAG binning, and conventional TAG compound integration methods shows that the TAG binning method performs similarly to the integration method. However, the new binning method incorporates the entirety of the data set and requires significantly less pre-processing of the data than conventional single compound identification and integration. In addition, while a fraction of the most oxygenated aerosol does not elute through an underivatized TAG analysis, the TAG binning method does have the ability to achieve molecular level resolution on other bulk aerosol components commonly observed by the AMS.« less
  5. Abstract

    A common way to integrate and analyze large amounts of biological “omic” data is through pathway reconstruction: using condition-specific omic data to create a subnetwork of a generic background network that represents some process or cellular state. A challenge in pathway reconstruction is that adjusting pathway reconstruction algorithms’ parameters produces pathways with drastically different topological properties and biological interpretations. Due to the exploratory nature of pathway reconstruction, there is no ground truth for direct evaluation, so parameter tuning methods typically used in statistics and machine learning are inapplicable. We developed the pathway parameter advising algorithm to tune pathway reconstruction algorithms to minimize biologically implausible predictions. We leverage background knowledge in pathway databases to select pathways whose high-level structure resembles that of manually curated biological pathways. At the core of this method is a graphlet decomposition metric, which measures topological similarity to curated biological pathways. In order to evaluate pathway parameter advising, we compare its performance in avoiding implausible networks and reconstructing pathways from the NetPath database with other parameter selection methods across four pathway reconstruction algorithms. We also demonstrate how pathway parameter advising can guide reconstruction of an influenza host factor network. Pathway parameter advising is method agnostic; itmore »is applicable to any pathway reconstruction algorithm with tunable parameters.

    « less