skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Robustness test of the spacegroupMining model for determining space groups from atomic pair distribution function data
Machine learning models based on convolutional neural networks have been used for predicting space groups of crystal structures from their atomic pair distribution function (PDF). However, the PDFs used to train the model are calculated using a fixed set of parameters that reflect specific experimental conditions, and the accuracy of the model when given PDFs generated with different choices of these parameters is unknown. In this work, the results of the top-1 accuracy and top-6 accuracy are robust when applied to PDFs of different choices of experimental parameters r max , Q max , Q damp and atomic displacement parameters.  more » « less
Award ID(s):
1922234
PAR ID:
10384138
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of Applied Crystallography
Volume:
55
Issue:
3
ISSN:
1600-5767
Page Range / eLocation ID:
626 to 630
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A method is presented for predicting the space group of a structure given a calculated or measured atomic pair distribution function (PDF) from that structure. The method utilizes machine learning models trained on more than 100 000 PDFs calculated from structures in the 45 most heavily represented space groups. In particular, a convolutional neural network (CNN) model is presented which yields a promising result in that it correctly identifies the space group among the top-6 estimates 91.9% of the time. The CNN model also successfully identifies space groups for 12 out of 15 experimental PDFs. Interesting aspects of the failed estimates are discussed, which indicate that the CNN is failing in similar ways as conventional indexing algorithms applied to conventional powder diffraction data. This preliminary success of the CNN model shows the possibility of model-independent assessment of PDF data on a wide class of materials. 
    more » « less
  2. A novel automated high-throughput screening approach,ClusterFinder, is reported for finding candidate structures for atomic pair distribution function (PDF) structural refinements. Finding starting models for PDF refinements is notoriously difficult when the PDF originates from nanoclusters or small nanoparticles. The reportedClusterFinderalgorithm can screen 104to 105candidate structures from structural databases such as the Inorganic Crystal Structure Database (ICSD) in minutes, using the crystal structures as templates in which it looks for atomic clusters that result in a PDF similar to the target measured PDF. The algorithm returns a rank-ordered list of clusters for further assessment by the user. The algorithm has performed well for simulated and measured PDFs of metal–oxido clusters such as Keggin clusters. This is therefore a powerful approach to finding structural cluster candidates in a modelling campaign for PDFs of nanoparticles and nanoclusters. 
    more » « less
  3. ABSTRACT In this work, we find empirical evidence that the scale-dependent statistical properties of solar wind and magnetohydrodynamic (MHD) turbulence can be described in terms of a family of parametric probability distribution functions (PDFs) known as Normal Inverse Gaussian (NIG). Understanding these PDFs is one of the most important goals in turbulence theory, as they are inherently connected to the intermittent properties of solar wind turbulence. We investigate the properties of PDFs of Elsasser increments based on a large statistical sample from solar wind observations and high-resolution numerical simulations of MHD turbulence. In order to measure the PDFs and their corresponding properties, three experiments are presented: fast and slow solar wind for experimental data and a simulation of reduced MHD (RMHD) turbulence. Conditional statistics on a 23-yr-long sample of WIND data near 1 au and high-resolution pseudo-spectral simulation of steadily driven RMHD turbulence on a $2048^3$ mesh are used to construct scale-dependent PDFs. The empirical PDFs are fitted to NIG distributions, which depend on four free parameters. Our analysis shows that NIG distributions accurately capture the evolution of the PDFs, with scale-dependent parameters, from large scales characterized by a Gaussian distribution, turning to exponential tails within the inertial range and stretched exponentials at dissipative scales. We also show that empirically-measured NIG parameters exhibit well-defined scaling properties that are similar across the three empirical data sets, which may be indicative of universal behaviour. 
    more » « less
  4. As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats. 
    more » « less
  5. Multilevel regression discontinuity designs have been increasingly used in education research to evaluate the effectiveness of policy and programs. It is common to ignore a level of nesting in a three-level data structure (students nested in classrooms/teachers nested in schools), whether unwittingly during data analysis or due to resource constraints during the planning phase. This study investigates the consequences of ignoring intermediate or top level in blocked three-level regression discontinuity designs (BIRD3; treatment is at level 1) during data analysis and planning. Monte Carlo simulation results indicated that ignoring a level during analysis did not affect the accuracy of treatment effect estimates; however, it affected the precision (standard errors, power, and Type I error rates). Ignoring the intermediate level did not cause a significant problem. Power rates were slightly underestimated, whereas Type I error rates were stable. In contrast, ignoring a top-level resulted in overestimated power rates; however, severe inflation in Type I error deemed this strategy ineffective. As for the design phase, when the intermediate level was ignored, it is viable to use parameters from a two-level blocked regression discontinuity model (BIRD2) to plan a BIRD3 design. However, level 2 parameters from the BIRD2 model should be substituted for level 3 parameters in the BIRD3 design. When the top level was ignored, using parameters from the BIRD2 model to plan a BIRD3 design should be avoided. 
    more » « less